Cluster evaluation

Description

The process shows, using the Aggregation dataset, how to gather and display cluster metrics.

Input

Aggregation [SIPU Datasets] [Aggregation]

The dataset contains 788 two-dimensional vectors, which form 7 separate groups. In the present case, the aim is to evaluate the clusters created.

Figure 12.11. The 788 vectors

The 788 vectors

Output

Figure 12.12. The evaluating subprocess

The evaluating subprocess


After reading the data, an agglomerative clustering is run with different parameters, and then using this, clusters can be created. A similarity function is created to measure cluster density, and then, the results of the measurements are saved for each parameter setting.

Figure 12.13. Setting up the parameters

Setting up the parameters


60 different settings are tested, the number of clusters ranging from 2 to 20, and all three of the agglomeration strategies of the agglomerative clustering are tried out.

Figure 12.14. Parameters to log

Parameters to log


The cluster sizes, the cluster densities, the distribution of the points, and the agglomeration strategy are saved for each setting.

Figure 12.15. Cluster density against k number of clusters

Cluster density against k number of clusters


Figure 12.16. Item distribution against k number of clusters

Item distribution against k number of clusters


The final result can be acquired by reading the log.

Interpretation of the results

The final result shows that the increase in the number of clusters leads to the increase of cluster densities, and the decrease of point distribution in different paces for the tree different strategies. However, the single link strategy falls a bit behind compared to the complete link and average link methods.

Video

Workflow

clust2_exp3.rmp

Keywords

cluster evaluation
agglomerative clustering
single link
complete link
average link
point density
point distribution

Operators

Agglomerative Clustering
Cluster Density Performance
Data to Similarity
Flatten Clustering
Item Distribution Performance
Log
Log to Data
Loop Parameters
Multiply
Read CSV