1

I have two models trained using same data the KMeans model in like below:

    int numIterations = 20;
    int numClusters = 5;
    int runs = 10;
    double epsilon = 1.0e-6;

    KMeans kmeans = new KMeans();
    kmeans.setEpsilon(epsilon);
    kmeans.setRuns(runs);
    kmeans.setMaxIterations(numIterations);
    kmeans.setK(numClusters);
    KMeansModel model = kmeans.run(trainDataVectorRDD.rdd());

And the StreamingKmeans like below:

    int numOfDimensions = 3;
    int numClusters = 5;
    StreamingKMeans kmeans = new StreamingKMeans()
            .setK(numClusters)
            .setDecayFactor(1.0)            
            .setRandomCenters(numOfDimensions, 1.0, 0);

    kmeans.trainOn(trainDataVectorRDD);   

The idea with the streaming one is that i read off everything from kafka queue and and train the model and it will auto update as new data comes in.

I get two different cluster centers for both model. Where did I go wrong? The regular KMeans one is the correct one. I am just posting 2 out of 5 cluster centers here. Any help is appreciated, thank you =).

Clusters: Kmeans

clusterCenter: [1.41012161E9,20.9157142857143,68.01750871080174]

clusterCenter: [2.20259211E8,0.6811821903787257,36.58268423745944]

Clusters: StreamingKmeans

clusterCenter: [-0.07896129994296074,-1.0194960760532714,-0.4783789312386866]

clusterCenter: [1.3712228467872134,-0.16614353149605163,0.24283231360124224]

Subba Rao
  • 165
  • 14

1 Answers1

1

k-means is randomized. If you run it twice, you will likely get two different results. In particular, they may not align (i.e. cluster 1 may not match cluster 1 in the other result).

Furthermore, streaming k-means is likely allowed only a single pass over the data, so the results are expected to be somewhat similar to k-means after 1 iteration.

Update: Sparks StreamingKMeans setRandomCenters chooses the initial centers from a N(0;1) distribution. Depending on your data, this may be a bad idea, and some cluster centers (e.g. those with negative coordinates) will simply remain empty forever. In my opinion this is a really stupid initialization method, useless for most applications.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • That is true, but the problem here is the streaming cluster centroid doesnt even make sense i.e it is not in the data set. You could be on to something here, I will run the normal kmeans later with single pass and see if they match. Thanks for answering :) – Subba Rao May 24 '16 at 14:25
  • StreamingKMeans in spark is worse than I thought. `setRandomCenters` will draw **random gaussians** from N(0;1) and assumes that this is a good idea for your data. Now some of these centers probably never received a single point! – Has QUIT--Anony-Mousse May 24 '16 at 17:41
  • Setting the normal k-means to 1 iteration and 1 run still gives a sensible answer which is in the range of the data. One way that I found to workaround this issue is that if I do the normal k-means and use the cluster centers as the setIntialCenters input for the streaming k-means then it is correct. – Subba Rao May 25 '16 at 03:00
  • You could also try using the first k objects as initial data centers. That would be the usual approach for streaming k-means. – Has QUIT--Anony-Mousse May 25 '16 at 04:39