15

if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class

this post (k-means with selected initial centers) indicates that I only need to set n_init=1 if I am using a numpy array as the initial centroids but I am not sure if my initialization is working properly

Naftali Harris' excellent visualization page shows what I am trying to do http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

"I'll choose" --> "Packed Circles" --> run kmeans

#numpy array of initial centroids
startpts=np.array([[-0.12, 0.939, 0.321, 0.011], [0.0, 0.874, -0.486, 0.862], [0.0, 1.0, 0.0, 0.033], [0.12, 0.939, 0.321, -0.7], [0.0, 1.0, 0.0, -0.203], [0.12, 0.939, -0.321, 0.25], [0.0, 0.874, 0.486, -0.575], [-0.12, 0.939, -0.321, 0.961]], np.float64)

centroids= sk.KMeans(n_clusters=8, init=startpts, n_init=1)

centroids.fit(actual_data_points)

#get the array
centroids_array=centroids.cluster_centers_
Community
  • 1
  • 1
webmaker
  • 456
  • 1
  • 5
  • 15

1 Answers1

19

Yes, setting initial centroids via init should work. Here's a quote from scikit-learn documentation:

 init : {‘k-means++’, ‘random’ or an ndarray}

     Method for initialization, defaults to ‘k-means++’:   

     If an ndarray is passed, it should be of shape (n_clusters, n_features)
     and gives the initial centers.

What is the shape (n_clusters, n_features) referring to?

The shape requirement means that init must have exactly n_clusters rows, and the number of elements in each row should match the dimensionality of actual_data_points:

>>> init = np.array([[-0.12, 0.939, 0.321, 0.011],
                     [0.0, 0.874, -0.486, 0.862],
                     [0.0, 1.0, 0.0, 0.033],
                     [0.12, 0.939, 0.321, -0.7],
                     [0.0, 1.0, 0.0, -0.203],
                     [0.12, 0.939, -0.321, 0.25],
                     [0.0, 0.874, 0.486, -0.575],
                     [-0.12, 0.939, -0.321, 0.961]],
                    np.float64)
>>> init.shape[0] == 8  
True  # n_clusters
>>> init.shape[1] == actual_data_points.shape[1]
True  # n_features

What is n_features?

n_features is the dimensionality of your sample. For instance, if you were to cluster points on a 2D plane, n_features would be 2.

Sergei Lebedev
  • 2,659
  • 20
  • 23
  • so that is where I am confused, what is the shape (n_clusters, n_features) referring to? Is it ( n_clusters=8, n_features=startpts)? where startpts is the ndarray – webmaker Jul 13 '16 at 15:23
  • what is n_features? The only examples on the sklearn documentation site use `init='k-means++' ` The library source code doesn't have an example either – webmaker Jul 13 '16 at 15:32
  • initializing with a numpy array doesn't seem to change the way the kmeans algorithm runs. I have also ran it with `init='kmeans++' ` and I did not see a significant difference. is there a way to verify? – webmaker Jul 13 '16 at 20:34
  • The most direct way would be to look at the [code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L687), which simply uses `init` as is. Note that K-means is an iterative algorithm and may converge to the same parameter values from different starting points (manual and `'kmeans++'`). – Sergei Lebedev Jul 13 '16 at 21:01