Table of Contents
The workflow, using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, shows how to find outliers based on the distances measured between the data. This can be done either by measuring their distance from their k nearest neighbours, or by checking whether their distance from some data object is above a given threshold. The definition of an outlier is relative, it can always be defined in comparison with the distances between the data objects. Thus if the distances between the data objects are basically great, a high threshold has to be set for outliers.
Wisconsin Diagnostic Breast Cancer (WDBC) [UCI MLR]
It can be seen that outliers can be filtered out using the appropriate settings.
As for example even differences that range in the hundreds occur between the individual values of the represented
area, thus this result can be obtained by setting the threshold for the
Euclidean distance to the value 500.
Note that due to the existing great distances between the data objects, the number of outliers detected will only decrease to its true level if the threshold is incremented up to 500, and under a certain value, way too many data objects would be identified as outliers.