I am building an Online news clustering system using Lucene and Mahout libraries in java. I intend to use vector space model and tfidf weights for Kmeans(or fuzzy/streamKmeans). My plan is : Cluster initial articles,assign new article to the cluster whose centroid is closest based on a small distance threshold. The leftover documents that aren’t associated with any old clusters form new data(new topics). Separately cluster them among themselves and add these temporary cluster centroids to the previous centroids. Less frequently, execute the full batch clustering to recluster the entire set of documents. The problem arises in comparing a new article to a centroid to assign it to an old cluster. The centroid dimension is number of distinct words in initial data. But the dimension of new article is different. I am following the book Mahout in Action. Is there any approach or some sort of feature extraction to handle this. The following similar links still remain unanswered: https://stats.stackexchange.com/questions/41409/bag-of-words-in-an-online-configuration-for-classification-clustering https://stats.stackexchange.com/questions/123830/vector-space-model-for-online-news-clustering Thanks in advance
Asked
Active
Viewed 97 times
-1
-
Uhm, the first link *is* answered, since 2012... – Has QUIT--Anony-Mousse Jun 20 '15 at 16:29
1 Answers
0
Increase the dimensionality as desired, using 0 as new values.
From a theoretical point of view, consider the vector space as infinite dimensional.

Has QUIT--Anony-Mousse
- 76,138
- 12
- 138
- 194