K-means method

Description

The process shows the usage of the K-means clustering algorithm and illustrates the importance of the choice of its various parameters on the Aggregation dataset. This clustering algorithm can be fitted using the `Cluster` operator.

Input

Aggregation [SIPU Datasets] [Aggregation]

The dataset consists of `788` two-dimensional vectors, which form `7` separate groups. The task is to discover these groups which are called clusters. The difficulty of the task is in the alignment of the points, as smaller and larger clouds of points are present with different distances in space between them. The visualization is done by the `Graph Explore` operator.

Output

After reading the dataset, we drag and drop the `Cluster` operator and take the following setting. The user specify cluster number is chosen and, by the above scatterplot, the number of clusters is defined as `7`.

The result can also be displayed by the `Graph Explore` operator. One can see that the algorithm found the upper and right clusters but it has weak performance on the points below.

Let's try a different parameter settings where the choice of the initial cluster centers was the MacQueen method and the minimum value of the distances between the cluster centers minimum is choosen as `9`.

One can see that the result is better using this parameter setting, only for the left set of points below This result can be improved by significantly sophisticated clustering method. Láthatjuk hogy ezzel a paraméterezéssel az eredmény pontosabb lett, egyedül a baloldali alsó ponthalmaznál láthatunk még nagyobb hibát. Ezt már csak jóval haladottabb módszerrel tudnánk korrigálni.

Finally, let us look at what happens if we take a slightly larger number of clusters to be produced, which is let's say ` 8 `. In this case, the only changes is that the algorithm finds the two small clusters in the bottom left, but it cuts the major cluster beside them into three parts and cuts the upper right cluster.

Interpretation of the results

Based on the experiment, we can see that the simplest clustering algorithm such as the K-means method is able to extract the simple relationships. Moreover, if we choose the parameters of the algorithm well then the accuracy of the results can be increased. In addition, the `Cluster` operator provides several visualization functions that help to evaluate the results.

After selecting the menu Result the figure above shows the emerging main windows where you can see the results summary. On the left top at the cluster (segment) plot can be seen in the function of the input attributes. On the left bottom a pie chart shows the size of the clusters. On the right top the cluster statistics can be read, while on the right bottom the output list is shown. These windows can be enlarged individually. Among the many other tools available we want to point out the following two.

The figure above shows the centers of the clusters along with the total average of the attributes. Finally, the figure below shows the decision tree constructed by the clusters, which can be obtained in such way that the resulting cluster variable will be a classification target variable and a classification task is solved by fitting of a decision tree.

Workflow

`sas_clust_exp1.xml`

Keywords

 K-means unsupervised learning clustering

Operators

 Cluster Data Source Graph Explore