15

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:

During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.

That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.

If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
toobee
  • 2,592
  • 4
  • 26
  • 35
  • 2
    Here is an "minimal" reproductible example initialization in 1D: `{2, 3, 3, 3} {3, 7, 7, 7} {7, 8, 8, 8}`. (k=3) The first update will empty the middle cluster ! – Gabriel Devillers Nov 19 '20 at 17:39

8 Answers8

10

Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best. If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on. I hope this helped on your homework assignment last year.

offwhitelotus
  • 1,049
  • 9
  • 15
4

Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.

Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.

Andres Felipe
  • 625
  • 7
  • 24
  • `farthest point from the biggest cluster` "Biggest" in what respect? – ttnphns Jan 20 '17 at 12:48
  • 3
    I would interpret it as the biggest in terms of number of elements - but you could also pick the point farthest from its cluster center. – Ketil Nov 08 '17 at 11:07
  • I guess the furthest point from its centroid would be more rigorous. If you have a big cluster with points very close to their centroid, I can see no reason to split it. – Joseph Budin May 23 '18 at 12:24
4

Statement: k-means can lead to

below is the execution flow of k-means on given distribution

Consider above distribution of data points.

  • overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.

  • dash box represents cluster assign

  • legend in footer represents numberline

N=6 points

k=3 clusters (coloured)

final clusters = 2

blue cluster is orphan and ends up empty.

user1476394
  • 337
  • 4
  • 9
3

Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.

*Choose the point that contributes most to SSE *Choose a point from the cluster with the highest SSE *If there are several empty clusters, the above can be repeated several times.

***SSE = Sum of Square Error.

Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#

Punpun
  • 31
  • 2
2

You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want. your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.

Fivesheep
  • 236
  • 2
  • 7
1

"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.

I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.

If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Konstantin Schubert
  • 3,242
  • 1
  • 31
  • 46
0

For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).

0
  1. Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
  2. After the allocation step for all the points, check the number of datapoints in each cluster.
  3. If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
  4. Replace the selected cluster with these sub-clusters.
  5. I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.