Agglomerative hierarchical methods

Description

The process shows, using the Maximum Variance (R15) dataset, how agglomerative hierarchical clustering algorithms work. These clustering algorithms can be run by the Cluster operator.

Input

Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]

The dataset contains 600 two-dimensional vectors, which are concentrated into 15 clusters. The points are aligned around a center with the coordinates (10,10), in increasing distances from each other as they get further from the center. This is the difficulty of the task, as the clusters near the center are close to blending into each other.

Figure 22.10. Scatterplot of the Maximum Variance (R15) dataset

Scatterplot of the Maximum Variance (R15) dataset

Output

Firstly, the average linkage hierarchical method is applied. In this case, the distance between the clusters is calculated as the average of the pairwise distance of cluster elements by the algorithm. The results are shown in the following figure.

Figure 22.11. The result of the average linkage hierarchical clustering

The result of the average linkage hierarchical clustering

The goodness of clustering can be measured so that the original grouping Class attribute and the Segment attribute which contains the cluster membership obtained after clustering are plotted by a spatial bar chart. It can be seen that, apart from a permutation, there is a one-to-one correspondance between the lines and the columns except two records.

Figure 22.12. Evaluating of the clustering by 3D bar chart

Evaluating of the clustering by 3D bar chart

An other hierarchical clustering method is the Ward method. Using this, we obtain the following results.

Figure 22.13. The result of Ward clustering

The result of Ward clustering

Interpretation of the results

The process demonstrated that if the number of possible clusters is relatively large then it is worth choosing one of the automatic clustering procedures. In the SAS® Enterprise Miner™, the hierarchical clustering is available for this purpose in several different ways. The experiment also shows that the choice of the agglomerative method does not always affect the resulting clusters. The SAS proposes the cluster number by investigating the CCC graph, see figure below.

Figure 22.14. CCC plot of automatic clustering

CCC plot of automatic clustering

In addition, a schematic display on the location of the clusters, the so-called proximity diagram is also obtained which is clearly similar to the previously obtained scatterplot on the clusters.

Figure 22.15. Proximity graph of the automatic clustering

Proximity graph of the automatic clustering

Video

Workflow

sas_clust_exp2.xml

Keywords

hierarchical methods
average linkage
Ward method
CCC graph
clustering

Operators

Cluster
Data Source
Graph Explore