Chapter 22. Clustering 1

Standard methods

Table of Contents

K-means method
Agglomerative hierarchical methods
Comparison of clustering methods

K-means method


The process shows the usage of the K-means clustering algorithm and illustrates the importance of the choice of its various parameters on the Aggregation dataset. This clustering algorithm can be fitted using the Cluster operator.


Aggregation [SIPU Datasets] [Aggregation]

The dataset consists of 788 two-dimensional vectors, which form 7 separate groups. The task is to discover these groups which are called clusters. The difficulty of the task is in the alignment of the points, as smaller and larger clouds of points are present with different distances in space between them. The visualization is done by the Graph Explore operator.

Figure 22.1. The Aggregation dataset.

The Aggregation dataset.


After reading the dataset, we drag and drop the Cluster operator and take the following setting. The user specify cluster number is chosen and, by the above scatterplot, the number of clusters is defined as 7.

Figure 22.2. The setting of the Cluster operator.

The setting of the Cluster operator.

The result can also be displayed by the Graph Explore operator. One can see that the algorithm found the upper and right clusters but it has weak performance on the points below.

Figure 22.3. The result of K-means clustering when K=7

The result of K-means clustering when K=7

Let's try a different parameter settings where the choice of the initial cluster centers was the MacQueen method and the minimum value of the distances between the cluster centers minimum is choosen as 9.

Figure 22.4. The setting of the MacQueen clustering

The setting of the MacQueen clustering

One can see that the result is better using this parameter setting, only for the left set of points below This result can be improved by significantly sophisticated clustering method. Láthatjuk hogy ezzel a paraméterezéssel az eredmény pontosabb lett, egyedül a baloldali alsó ponthalmaznál láthatunk még nagyobb hibát. Ezt már csak jóval haladottabb módszerrel tudnánk korrigálni.

Figure 22.5. The result of the MacQueen clustering

The result of the MacQueen clustering

Finally, let us look at what happens if we take a slightly larger number of clusters to be produced, which is let's say 8 . In this case, the only changes is that the algorithm finds the two small clusters in the bottom left, but it cuts the major cluster beside them into three parts and cuts the upper right cluster.

Figure 22.6. The result of the clustering with 8 clusters

The result of the clustering with 8 clusters

Interpretation of the results

Based on the experiment, we can see that the simplest clustering algorithm such as the K-means method is able to extract the simple relationships. Moreover, if we choose the parameters of the algorithm well then the accuracy of the results can be increased. In addition, the Cluster operator provides several visualization functions that help to evaluate the results.

Figure 22.7. The result display of the Cluster operator

The result display of the Cluster operator

After selecting the menu Result the figure above shows the emerging main windows where you can see the results summary. On the left top at the cluster (segment) plot can be seen in the function of the input attributes. On the left bottom a pie chart shows the size of the clusters. On the right top the cluster statistics can be read, while on the right bottom the output list is shown. These windows can be enlarged individually. Among the many other tools available we want to point out the following two.

Figure 22.8. Scatterplot of the cluster means

Scatterplot of the cluster means

The figure above shows the centers of the clusters along with the total average of the attributes. Finally, the figure below shows the decision tree constructed by the clusters, which can be obtained in such way that the resulting cluster variable will be a classification target variable and a classification task is solved by fitting of a decision tree.

Figure 22.9. The decision tree of the clustering

The decision tree of the clustering





unsupervised learning


Data Source
Graph Explore