The process shows, using the Twenty Newsgroups dataset, how the clustering of documents can be performed.
A subset of the Twenty Newsgroups dataset [UCI MLR].
The data set was donated to the UCI Machine Learning Repository by Tom Mitchell.
The dataset contains about 20000 news articles belonging to 20 topics. The subset of this dataset utilized here contains only three of the topics, which are concerned with cars, electronics, and everyday politics.
The data are read by topic, transformed to lower-case, tokenized, stemmed, and then stop words are filtered out. After this, the only thing that needs to be done is to cluster the TF-IDF vectors by document.
The distance between the document vectors can be measured using the cosine similarity. The cluster labels are transformed into class labels, and then, it is checked to what extent the clusters cover the individual topics.
The results show that cars have severely blended with electronics, which is possibly not too far from reality, as there are numerous common points in the two professions.