Text clustering


The process shows, using the Twenty Newsgroups dataset, how the clustering of documents can be performed.


A subset of the Twenty Newsgroups dataset [UCI MLR].


The data set was donated to the UCI Machine Learning Repository by Tom Mitchell.

The dataset contains about 20000 news articles belonging to 20 topics. The subset of this dataset utilized here contains only three of the topics, which are concerned with cars, electronics, and everyday politics.


Figure 12.20. The preprocessing subprocess

The preprocessing subprocess

The data are read by topic, transformed to lower-case, tokenized, stemmed, and then stop words are filtered out. After this, the only thing that needs to be done is to cluster the TF-IDF vectors by document.

Figure 12.21. The clustering setup

The clustering setup

The distance between the document vectors can be measured using the cosine similarity. The cluster labels are transformed into class labels, and then, it is checked to what extent the clusters cover the individual topics.

Figure 12.22. The confusion matrix of the results

The confusion matrix of the results

Interpretation of the results

The results show that cars have severely blended with electronics, which is possibly not too far from reality, as there are numerous common points in the two professions.





K-means method
cosine similarity
text clustering
text mining


Map Clustering on Labels
Performance (Classification)
Filter Stopwords (English) [Text Mining Extension]
Process Documents from Files [Text Mining Extension]
Stem (Snowball) [Text Mining Extension]
Tokenize [Text Mining Extension]
Transform Cases [Text Mining Extension]