Text clustering

Description

The process shows, using the Twenty Newsgroups dataset, how the clustering of documents can be performed.

Input

A subset of the Twenty Newsgroups dataset [UCI MLR].

Note

The data set was donated to the UCI Machine Learning Repository by Tom Mitchell.

The dataset contains about 20000 news articles belonging to 20 topics. The subset of this dataset utilized here contains only three of the topics, which are concerned with cars, electronics, and everyday politics.

Output

Figure 12.20. The preprocessing subprocess

The preprocessing subprocess


The data are read by topic, transformed to lower-case, tokenized, stemmed, and then stop words are filtered out. After this, the only thing that needs to be done is to cluster the TF-IDF vectors by document.

Figure 12.21. The clustering setup

The clustering setup


The distance between the document vectors can be measured using the cosine similarity. The cluster labels are transformed into class labels, and then, it is checked to what extent the clusters cover the individual topics.

Figure 12.22. The confusion matrix of the results

The confusion matrix of the results


Interpretation of the results

The results show that cars have severely blended with electronics, which is possibly not too far from reality, as there are numerous common points in the two professions.

Video

Workflow

clust2_exp5.rmp

Keywords

K-means method
cosine similarity
text clustering
text mining

Operators

k-Means
Map Clustering on Labels
Performance (Classification)
Filter Stopwords (English) [Text Mining Extension]
Process Documents from Files [Text Mining Extension]
Stem (Snowball) [Text Mining Extension]
Tokenize [Text Mining Extension]
Transform Cases [Text Mining Extension]