0

I am using a modified Lloyd's algorithm for obtaining equal cluster size outputs in kmeans with k=2. Following is the pseudocode:

- Randomly choose 2 points as initialization for the 2 clusters (denoted as c1, c2)
- Repeat below steps until convergence
    - Sort all points xi according to ascending values of ||xi-c1|| - ||xi-c2||, i.e. differences in distances to the first and the second cluster
    - Put top 50% points in cluster 1 , others in cluster 2
    - Recalculate centroids as average of the allocated points (as usual in Lloyd's)

Now the above algorithm is working fine for me empirically:

  1. It gives balanced clusters
  2. It always decreases the objective

Has such an algorithm been proposed or analyzed before in literature? Can I get some references please?

vervenumen
  • 657
  • 7
  • 13

1 Answers1

2

A more general version for more than 2 clusters is explained here:

https://elki-project.github.io/tutorial/same-size_k_means

I have seen k-means with various size constraints several times in literature, but I don't have any references at hand. I'm not convinced of this: forcing clusters to have the same size contradicts the k-means idea of finding the least-squares best approximation IMHO, as it means deliberately choosing a worse approximation.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for the reference! In my opinion, there is a crucial difference between my algorithm and the one in the reference: For k=2, the point assignment step can be solved exactly as above, while for the more general k>2 it does not seem to be the case. Hence, in the above link, they are using a local point swapping procedure which is unnecessary when k=2. I wanted to know if proof for the case of k=2 exists somewhere.. – vervenumen May 15 '17 at 08:28
  • I don't think the k=2 case is of much special interest; because one usually is looking for more clusters. I definitely have seen this kind of operation for k=2 in metrical indexing. – Has QUIT--Anony-Mousse May 15 '17 at 22:04