4

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.

A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.

Can someone please guide me?

Kara
  • 6,115
  • 16
  • 50
  • 57
NightFox
  • 125
  • 1
  • 10

2 Answers2

5

It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris@de-vries.id.au.

If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.

Chris de Vries
  • 56,777
  • 5
  • 32
  • 27
2

Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.

Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.

Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Yeah I need K clusters, because they will serve as vocabulary (codebook) for my data points. I will use them for Bag of Word model. – NightFox Jun 11 '13 at 06:47
  • Well, you could just use however many clusters you get out, and ignore unassigned points, for example. If the clustering algorithm says that 317 clusters is appropriate and 1000 points are noise, why not use 317 words? – Has QUIT--Anony-Mousse Jun 11 '13 at 08:01
  • What I understand of LSH that it can be used for finding nearest neighbors to some initial cluster centers.I am still confused how will I pick the next cluster centers for next iteration of clustering process. – NightFox Jun 11 '13 at 22:24
  • Or do we really need to run next iterations? Can you please elaborate that. – NightFox Jun 11 '13 at 22:30
  • You can use LSH to accelerate clustering algorithms based on neighborhoods. Probably not k-means but pretty much any more modern algorithm. – Has QUIT--Anony-Mousse Jun 12 '13 at 06:12