The process shows, using the Flame dataset, how incidentally occurring outliers, anomalies can be found in a dataset using statistics based unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., a statistics based, histogram based method. In this case, groups of values are defined for each attribute with a histogram, and based on the deviation from these can the given value in the given column be considered an outlier. After this, using these scores is the overall outlier score of the records defined. The higher this indicator value is, the value or record will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used. At the same time, due to this fact, it can be more illustrative to use colors instead of only values to indicate the outlier scores, which is done automatically be the histogram based method.
The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the histogram based method, which defines the outlier score of the individual values in each column of the dataset, and based on these, it calculates the final score of the records. This method can be refined with multiple settings, either on operator level, or on column level as well:
Based on the settings, the operator splits the set of values in the individual columns into either a pre-defined or an arbitrary number of bins which are either equal or variable in length. Based on these, it assigns color codes, and calculates the record level score based on the scores of the column values as well. Using a fixed binwidth, and an arbitrary number of bins, the following values are returned as a result:
Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly. In this case, however, a more detailed examination is possible as well, as based on the colour codes set, it can be viewed how much the probability of the individual attributes containing outliers is, and if these coincide with outlier values of other columns. Based on this, on one hand, the model can be refined if necessary, and on the other hand, in some cases, it can be easier to define which values should be considered an anomaly. By checking the graphic representation of the model built based on the scores, it can be seen that there are sightly outlying values that have not been assigned a high score:
Based on this, it might be advisable to alter the model, for example to split the attributes into dynamically sized bins. This enhances the performance of outlier detection, as can be seen in the following results:
|statistics based anomaly detection|
|histogram based anomaly detection|