Unsupervised search for outliers

Description

The process shows, using a sample of the Individual household electric power consumption dataset, how incidentally occurring outliers, anomalies can be found in a dataset with unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., in general cases, methods based on k nearest neighbours, which assign an outlier indicator value to each element based on its distance from its k nearest neighbours. The higher this indicator value is, the value of the given item will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used.

Input

Individual household electric power consumption [UCI MLR]

Output

The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the method based on k nearest neighbours, and the LOF metric based on this, which relies on the k nearest neighbours method, but it also takes density into consideration.

Figure 13.3. Nearest neighbour based operators in the Anomaly Detection package

Nearest neighbour based operators in the Anomaly Detection package

Figure 13.4. Settings of LOF.

Settings of LOF.

These methods assign different scores to the elements, based on which it can be seen which elements are outliers. The k nearest neighbours method assigns the following scores to the elements of the dataset:

Figure 13.5. Outlier scores assigned to the individual records based on k nearest neighbours

Outlier scores assigned to the individual records based on k nearest neighbours

The LOF method assigns the following scores to the elements of the dataset:

Figure 13.6. Outlier scores assigned to the individual records based on LOF

Outlier scores assigned to the individual records based on LOF

Interpretation of the results

Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly, and the elements with scores above this threshold, i.e. outliers can also be immediately filtered out of the dataset, or a separate dataset can be formed from them:

Figure 13.7. Filtering the records based on their outlier scores

Filtering the records based on their outlier scores

For example, using the k-NN method, the following dataset appears as a result after removing the values rated as outliers:

Figure 13.8. The dataset filtered based on the k-NN score

The dataset filtered based on the k-NN score

Furthermore, the set of elements rated as outlier based on the LOF is the following:

Figure 13.9. The dataset filtered based on the LOF score

The dataset filtered based on the LOF score

Video

Workflow

anomaly_exp2.rmp

Keywords

outliers
anomaly detection
k nearest neighbours

Operators

Filter Examples
Multiply
Read CSV
k-NN Global Anomaly Score [Anomaly Detection]
Local Outlier Factor (LOF) [Anomaly Detection]