The process shows the usage of the K-means clustering algorithm and illustrates the importance of the choice
of its various parameters on the Aggregation dataset. This clustering algorithm
can be fitted using the
The dataset consists of
788 two-dimensional vectors, which form
7 separate groups.
The task is to discover these groups which are called clusters. The difficulty of the task is in the alignment of the
points, as smaller and larger clouds of points are present with different distances in space between them.
The visualization is done by the
Graph Explore operator.
After reading the dataset, we drag and drop the
Cluster operator and take the following setting.
The user specify cluster number is chosen and, by the above scatterplot, the number of clusters is defined as
The result can also be displayed by the
Graph Explore operator. One can see that the algorithm
found the upper and right clusters but it has weak performance on the points below.
Let's try a different parameter settings where the choice of the initial cluster centers was the MacQueen method
and the minimum value of the distances between the cluster centers minimum is choosen as
One can see that the result is better using this parameter setting, only for the left set of points below This result can be improved by significantly sophisticated clustering method. Láthatjuk hogy ezzel a paraméterezéssel az eredmény pontosabb lett, egyedül a baloldali alsó ponthalmaznál láthatunk még nagyobb hibát. Ezt már csak jóval haladottabb módszerrel tudnánk korrigálni.
Finally, let us look at what happens if we take a slightly larger number of clusters to be produced, which is
8 . In this case, the only changes is that the algorithm finds the two small
clusters in the bottom left, but it cuts the major cluster beside them into three parts and cuts the upper
Based on the experiment, we can see that the simplest clustering algorithm such as the K-means method is able
to extract the simple relationships. Moreover, if we choose the parameters of the algorithm well then the
accuracy of the results can be increased. In addition, the
Cluster operator provides
several visualization functions that help to evaluate the results.
After selecting the menu Result the figure above shows the emerging main windows where you can see the results summary. On the left top at the cluster (segment) plot can be seen in the function of the input attributes. On the left bottom a pie chart shows the size of the clusters. On the right top the cluster statistics can be read, while on the right bottom the output list is shown. These windows can be enlarged individually. Among the many other tools available we want to point out the following two.
The figure above shows the centers of the clusters along with the total average of the attributes. Finally, the figure below shows the decision tree constructed by the clusters, which can be obtained in such way that the resulting cluster variable will be a classification target variable and a classification task is solved by fitting of a decision tree.