Chapter 13. Anomaly detection

Table of Contents

Searching for outliers
Unsupervised search for outliers
Unsupervised statistics based anomaly detection

Searching for outliers

Description

The workflow, using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, shows how to find outliers based on the distances measured between the data. This can be done either by measuring their distance from their k nearest neighbours, or by checking whether their distance from some data object is above a given threshold. The definition of an outlier is relative, it can always be defined in comparison with the distances between the data objects. Thus if the distances between the data objects are basically great, a high threshold has to be set for outliers.

Input

Wisconsin Diagnostic Breast Cancer (WDBC) [UCI MLR]

Output

It can be seen that outliers can be filtered out using the appropriate settings. As for example even differences that range in the hundreds occur between the individual values of the represented attribute area, thus this result can be obtained by setting the threshold for the Euclidean distance to the value 500.

Figure 13.1. Graphic representation of the possible outliers

Graphic representation of the possible outliers

Interpretation of the results

Note that due to the existing great distances between the data objects, the number of outliers detected will only decrease to its true level if the threshold is incremented up to 500, and under a certain value, way too many data objects would be identified as outliers.

Figure 13.2. The number of outliers detected as the distance limit grows

The number of outliers detected as the distance limit grows

Video

Workflow

anomaly_exp1.rmp

Keywords

outliers
data preprocessing
data cleansing

Operators

Detect Outlier (Densities)
Detect Outlier (Distances)
Multiply
Read AML