Unsupervised statistics based anomaly detection

Description

The process shows, using the Flame dataset, how incidentally occurring outliers, anomalies can be found in a dataset using statistics based unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., a statistics based, histogram based method. In this case, groups of values are defined for each attribute with a histogram, and based on the deviation from these can the given value in the given column be considered an outlier. After this, using these scores is the overall outlier score of the records defined. The higher this indicator value is, the value or record will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used. At the same time, due to this fact, it can be more illustrative to use colors instead of only values to indicate the outlier scores, which is done automatically be the histogram based method.

Input

Flame [SIPU Datasets] [Flame]

Output

The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the histogram based method, which defines the outlier score of the individual values in each column of the dataset, and based on these, it calculates the final score of the records. This method can be refined with multiple settings, either on operator level, or on column level as well:

Figure 13.10. Global settings for Histogram-based Outlier Score

Global settings for Histogram-based Outlier Score

Figure 13.11. Column-level settings for Histogram-based Outlier Score

Column-level settings for Histogram-based Outlier Score

Based on the settings, the operator splits the set of values in the individual columns into either a pre-defined or an arbitrary number of bins which are either equal or variable in length. Based on these, it assigns color codes, and calculates the record level score based on the scores of the column values as well. Using a fixed binwidth, and an arbitrary number of bins, the following values are returned as a result:

Figure 13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins

Scores and attribute binning for fixed binwidth and arbitrary number of bins

Interpretation of the results

Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly. In this case, however, a more detailed examination is possible as well, as based on the colour codes set, it can be viewed how much the probability of the individual attributes containing outliers is, and if these coincide with outlier values of other columns. Based on this, on one hand, the model can be refined if necessary, and on the other hand, in some cases, it can be easier to define which values should be considered an anomaly. By checking the graphic representation of the model built based on the scores, it can be seen that there are sightly outlying values that have not been assigned a high score:

Figure 13.13. Graphic representation of outlier scores

Graphic representation of outlier scores

Based on this, it might be advisable to alter the model, for example to split the attributes into dynamically sized bins. This enhances the performance of outlier detection, as can be seen in the following results:

Figure 13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins

Scores and attributes binning for dynamic binwidth and arbitrary number of bins

Figure 13.15. Graphic representation of the enhanced outlier scores

Graphic representation of the enhanced outlier scores

Video

Workflow

anomaly_exp3.rmp

Keywords

outliers
anomaly detection
statistics based anomaly detection
histogram based anomaly detection
bin size

Operators

Read CSV
Histogram-based Outlier Score (HBOS) [Anomaly Detection]