1

I have some set of documents, I just want to group related docs. Currently I'm using google's news vector file (GoogleNews-vectors-negative300.bin) and with this vector file I'm getting the vector and I use WMD (Word Mover Distance) algorithm to get distance between two documents. Now I want to integrate this with K-means clustering.Basically I want to override the distance calculation function in KMeans. How can I do that? Any suggestion are most welcome. Thanks in advance.

kathir raja
  • 640
  • 8
  • 19

1 Answers1

3

Although it is possible in theory implement k-means with other distance measures, it is not advised - your algorithm could stop converging. More detailed discussion can be found e.g. on StackExchange. That's why scikit-learn does not feature other distance metrics.

I'd suggest using e.g. hierarchical clustering, where you can plug in arbitrary distance function.

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53