0

I ran the K-means clustering algorithm against a set of sequence files. However, the generated result looks like this:

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

The program I use is borrowed from NewsKMeansClustering.java, an example given in chapter 9 of Mahout-in-Action.

Would you like to let me know why I get this type of result? Is that because of any specific parameter setting requirement or anything else?

The core clustering code in this program is

CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), 
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
Community
  • 1
  • 1
user873766
  • 37
  • 6

2 Answers2

3

I ran into the same issue using Mahout 0.5. I think the problem is that the normPower parameter is used in both functions. Try code similar to this.

DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                outputDir, conf, minSupport, maxNGramSize,
                minLLRValue,
                -1.0f, // no normalization here
                logNormalize, numReducers, chunkSize,
                sequentialAccessOutput, namedVector);
TFIDFConverter.processTfIdf(vectorOutput, new Path(outputDir,
                "tfidf"), conf, chunkSize, minDf, 
                maxDFPercent,normPower,
                logNormalize, sequentialAccessOutput, namedVector,
                numReducers);

After that I stopped having problems with empty clusters.

rwaury
  • 31
  • 3
2

I had this problem. As a newbie it was very difficult to solve. However, in my case, I realised that the T1 and T2 values for the canopy clustering were only valid for the Reuters data (and Euclidean norm) provided. I had used my own document data which seemed to have an inherently different distribution of distances between document vectors. So I did some rudimentary analysis then re-estimated T1 and T2 from my own data. Then things worked. See my post also at...

How to pick the the T1 and T2 threshold values for Canopy Clustering?

Hope this helps.

Community
  • 1
  • 1
rpd
  • 41
  • 2