What to do when KMeans returns fewer than K clusters?

Question

I've implemented K-Means in Java and have a bit of a head scratcher. I select my initial centroids by choosing a random value in each dimension within the range of values of the data points. I've run into cases where this results in one or more of these centroids not ending up being the closet centroid of any data point. So what do I do for the next iteration? Just leave it at its original randomized value? Pick a new random value? Compute as an average of the other centroids? Seems like this isn't accounted for in the original algorithm, but probably I've just missed something.

score 3 · Accepted Answer · answered Jan 04 '12 at 01:13

Most implementations of k-means define initial centroids using actual data points, not random points in the bounding box drawn by the variables. However, some suggestions for solving your actual problem are below.

You could take another data-point at random and make it a new cluster centroid. This is very simple and fast to implement, and shouldn't affect the algorithm adversely.

You could also try making a smarter initial selection of cluster centroids using kmeans++. This algorithm chooses the first centroid randomly, and picks the remaining K-1 centroids to try and maximize the inter-centroid distance. By picking smarter centroids, you are much less likely to encounter the problem of a centroid being assigned zero data points.

If you wanted to be slightly more clever clever, you could use the kmeans++ algorithm to make a new centroid whenever a centroid gets assigned zero data points.

score 1 · Answer 2 · answered Jan 04 '12 at 01:10

The way I've used it, the initial values were taken as random points from the data set, not random points in the spanned space. That means each cluster has at least one point in it initially. You could still get unlucky with outliers but with any luck you'll be able to detect this and restart with different points. (Provided "K clusters of points" is an adequate description of your data)

score 1 · Answer 3 · answered Jan 04 '12 at 01:11

1

Instead of picking random values (which can be pretty meaningless if the space of possible values is large in comparison to the clusters), many implementations pick random points from the dataset as the initial centroids.

answered Jan 04 '12 at 01:11

phihag

278,196
72
453
469

What to do when KMeans returns fewer than K clusters?

3 Answers3