Number of clusters obtained using carrot2 inconsistent on the same data set

Question

I am using carrot2 for clustering a set of 500 emails. I am using the BisectingKMeans algorithm provided by carrot2. On the same data set, when I specify k = 9, only 6 are generated and when I give it to run with 8 clusters, 7 are generated – however when I give 10 clusters to run , all 10 are generated. Can anyone please help me figure out the reason behind this?

score 0 · Accepted Answer · answered Jun 05 '13 at 18:33

0

I've had a look at the code and it looks like this behaviour was caused by a bug in the cluster splitting routine. I've committed a fix to the master line of Carrot2, which makes the number of generated clusters more predictable. You can download the binaries with the fix from Carrot2 build server.

answered Jun 05 '13 at 18:33

Stanislaw Osinski

1,231
1
7
9

Thank you for your response. I tried compiling my code using the new build but now I am getting this error: class file for org.carrot2.util.attribute.IObjectFactory not found – afs Jun 06 '13 at 06:07
My bad. Found the IObjectFactory in the attributes-binder-1.2.0.jar. Thank you for the help. In your response you mention that the bug fix makes the number of clusters generated more predictable. Can you please elaborate on what makes the number of clusters generated un-predictable although we are using k-means? – afs Jun 06 '13 at 06:30
I changed the way the centroids are initialized. Previously, the algorithm would take a number of input documents as initial centroids. The problem with this approach was that the algorithm wasn't often able to split a cluster because all the documents would group only around one such initialized centroid. The inability to split large clusters would lower the number of total clusters produced. The [change](https://github.com/carrot2/carrot2/commit/f6ed6f3898258ec5d93e68c583e9ff6f24d8dc9e#L0R423) was to initialize the centroids based on all of the documents in the cluster being split. – Stanislaw Osinski Jun 06 '13 at 06:53

Number of clusters obtained using carrot2 inconsistent on the same data set

1 Answers1