Algoritmos de inteligencia computacional para abordar problemas de detección de anomalías en entornos big data

Carrasco Castillo, Jacinto

Algoritmos de inteligencia computacional para abordar problemas de detección de anomalías en entornos big data

Carrasco Castillo, Jacinto

Supervised by:

Francisco Herrera Triguero Co-director
Julián Luengo Martín Co-director

Defence university: Universidad de Granada

Fecha de defensa: 10 March 2023

Committee:

Sebastián Ventura Soto Chair
Óscar Cordón García Secretary
María José del Jesús Díaz Committee member

Type: Thesis

Teseo: 780973 DIALNET DIGIBUG editor

Abstract

The proliferation of the use of computer systems in all kinds of fields, whether medical, industrial, economic or scientific, has brought with it the generation of everincreasing volumes of data. This has led to the need to create new technologies that allow the storage and analysis of this data, as well as generating new circumstances in which the aim is to extract knowledge from it. One of the usual scenarios is that of anomaly detection, where the interest lies in the identification of a minority class of data, either because it may pose a threat to the system under study, as in the case of fraud detection or predictive maintenance of industrial systems, or in medical environments, where there are few samples of data from patients with a disease compared to the common healthy population and the aim is to detect that disease. The fact that the focus is on the minority class differentiates anomaly detection from noise detection, defined as an effect on the data that we want to mitigate in the data pre-processing phase but whose cause is not relevant to the investigation. Therefore, we can identify different scenarios within the scope of anomaly detection depending on the availability of information at the time of learning the algorithm: supervised scenarios, assimilable to unbalanced classification problems; semi-supervised or novelty detection scenarios, where a normality model is generated based on the data of the majority class, the only ones available in the training phase; and unsupervised scenarios, where no information is available on the class of the instances. These differences result in the existence of different evaluation methods and in the need to resort to additional mechanisms for the extraction of interpretable knowledge in scenarios where the representation learned by the model is insufficient for the understanding of the problem. In this thesis we focus on the study of the anomaly detection problem for unsupervised scenarios, both for time series problems and for static data. This study starts from the demarcation of the problem within the anomaly detection domain to move on to the design of a distributed algorithm for anomaly detection valid for both static and time series data focused on obtaining explanations to help decision making and understanding of the studied dataset. Finally, an evaluation model for unsupervised time series anomaly detection scenarios is proposed. Specifically, the proposals made in the framework of the thesis are: A distributed anomaly detection model focused on explainability. For this model we rely on the HBOS algorithm, which performs univariate histograms for anomaly score assignment, and extend it to search for anomalies in higher dimensionality subspaces. The use of this algorithm as a basis is justified by the possibility of constructing a knowledge representation that allows in later phases to reconstruct histograms of higher dimensionality subspaces by taking advantage of certain calculations. Furthermore, the knowledge representation allows us to include a proposal for the construction of rules to describe the reasons for the categorisation of specific instances through counterfactuals, rules that justify why an instance belongs to one class and not to another. In the experimentation associated with this proposal, it can be seen that the results are not comparable to the state of the art in anomaly detection, the lower performance being the counterpart to the simplicity of the model that allows the rules to be obtained. A model for evaluating anomaly detection algorithms for time series. In the field of anomaly detection, there are multiple evaluation schemes. In particular, it is common to find in time series scenarios the application of anomaly score prediction models for time instances while identifying events of interest that occur subsequent to the anomalous predictions. However, these methods pose problems such as the need to set certain parameters for the evaluation such as the definition of a window prior to the event of interest or weights to reward fast detection or the multiplication of the effect of interclass imbalance. Therefore, we propose a scoring mechanism based on the definition of multiple windows prior to the events of interest and the use of a generalised ROC curve for the different windows such that the aggregation of the instances by a function is the anomaly score for that interval. This proposal includes an implementation for classical environments and another for distributed environments and a comparison with a proposed evaluation measure for anomaly detection assimilated by its work with intervals, where we show not only the usefulness of our measure for evaluation in the described scenarios but also the computational efficiency of our measure versus this alternative. The proposals made provide solutions to specific problems in anomaly detection research, such as the lack of models capable of working in distributed environments and offering explanations as to why an instance is classified as anomalous or normal, and the dissociation of certain evaluation systems that consider specific instances for the evaluation of events that occur over a period of time.