Table of Contents
The process demonstrates how to cluster the attributes by using the
operator when there are a number of attributes in the dataset. The process uses the
dataset. After clustering the attributes, further supervised data mining methods can be applied, e.g.,
we may classify the e-mails into spam and non-spam classes.
Spambase [UCI MLR]
The dataset contains
4601 records and
58 attributes. The records
are classified into
2 groups by the
Class variable, which identifies
the spam e-mails, i.e., its value equals to 1 if the record is spam and 0 otherwise.
The challenge in the dataset is that there are relatively large number of attributes which slow down
the training process. The experiment points out that a competitive model can be obtained after a suitable
clustering of the attributes to such models which are fitted on the whole dataset.
During the attribute clustering the columns of the dataset are clustered by a hierarchical method to
reduce the dimension of the dataset. The most important parameter of the
operator is the
Maximum Cluster, which can be used to adjust the maximal number of
clusters. Similar parameters are the maximal number of eigenvalues and the explained variance. You can also
choose between the correlation and the covariance matrix in the analysis. One of the most important results
is the dendrogram which visualizes the process of the hierarchical clustering.
The relationship between the original attributes and the obtained clusters is depicted on the following graph.
The list of cluster membership, i.e., the set of attributes belonging to clusters, respectively, can be seen on the following figure.
To create clusters the correlation (covariance) between the original attributes plays the most important role. Those attributes will be in one cluster which have high correlation with each other. This displays in the following figure.
It can also be investigated how high is the correlation between each variable and the new cluster variables obtained. The following figure shows the correlation bar chart of the variable representing the special character dollar.
After the attribute clustering, SVM model was fitted to the
Class binary variable
by using the obtained
19 new cluster attributes. Then, the results obtained in this
way were compared with a similar model fitted to the original
58 attributes directly.
The results below show that the models obtained have similar performance. The classification bar charts
show a similar classification matrix.
The response curve behaves better in some places on the clustered attributes than on the original ones.
If the cumulative lift functions are compared considering the baseline and the best lift functions, similar behavior can be seen.
Finally, the ROC curves are very similar to each other.
If there are very much input attributes in teaching a supervised data mining model, which makes teaching so very slow, then it worth reducing the dimension by clustering the input attributes. The explanatory power of the resulting model is usually not much worse than the one fitted to the original attributes.