The process shows, using a sample of the Individual household electric power consumption dataset, how incidentally occurring outliers, anomalies can be found in a dataset with unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., in general cases, methods based on k nearest neighbours, which assign an outlier indicator value to each element based on its distance from its k nearest neighbours. The higher this indicator value is, the value of the given item will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used.
Individual household electric power consumption [UCI MLR]
The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the method based on k nearest neighbours, and the LOF metric based on this, which relies on the k nearest neighbours method, but it also takes density into consideration.
These methods assign different scores to the elements, based on which it can be seen which elements are outliers. The k nearest neighbours method assigns the following scores to the elements of the dataset:
The LOF method assigns the following scores to the elements of the dataset:
Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly, and the elements with scores above this threshold, i.e. outliers can also be immediately filtered out of the dataset, or a separate dataset can be formed from them:
For example, using the k-NN method, the following dataset appears as a result after removing the values rated as outliers:
Furthermore, the set of elements rated as outlier based on the LOF is the following: