1

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this enter image description here

But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly. Also What would be the ideas if I want to know the points not correctly classify in scikit learn. Here is the code.

km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()

Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.

enter image description here

I might have happened because of bad choice of initial centroids but what could be the solution ?

Community
  • 1
  • 1
Hima
  • 11,268
  • 3
  • 22
  • 31

2 Answers2

1

First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.

Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.

For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.

Community
  • 1
  • 1
rajat
  • 864
  • 1
  • 9
  • 23
  • Tried everything.Even tried to set initial centroids explicitly.gives the same result. – Hima Jan 24 '17 at 16:50
  • If you set initial clusters explicitly and still get the same answer, then it's due to outliers and nothing can be done as k-means is sensitive to outliers. You will have to use other clustering algorithms like DBSCAN, SLINK (Heirarchical) etc or if you have to use k-means modify the data by removing the outliers using the method I have suggested in the answer – rajat Jan 25 '17 at 10:59
1

I got a solution for this. The problem was scaling. I just scaled both axes using

sklearn.preprocessing.scale

And this is my result enter image description here

Hima
  • 11,268
  • 3
  • 22
  • 31