1

I have a CSV with the following dataset:

similarity  | doc_id1   | doc_id2
1           |    34     |     0
1           |    29     |     6
0.997801748 |    22     |    10
0.966014701 |    35     |    16
0.964811948 |    14     |    13

Where "similarity" refers to a value from tf-idf cosine similarity computations and the doc_ids refer to documents. So, the closer similarity is to 1, the more similar the two documents are.

I want to cluster the documents based on this information, but I'm not entirely sure how to do so. I've been reading a lot about spherical K-means clustering, but in terms of implementing it I'm having a hard time wrapping my head around it. Is there a library that might be useful? Is K-means the right way to go at all?

EDIT: This CSV is all I have, so even though I wish I had word frequency based vectors, I don't. If K-means won't work given that all I have are similarities, are there other algorithms that would suit this data?

coolbeans
  • 31
  • 4

3 Answers3

1

I believe that your problem is that you have distances, but K-Means uses Euclidean distances from centroids. This means, that you will need a vector for each document, pretty long vectors in your case. Instead of calculated similarity you should use one dimension for all word, and the score for that word in each document would make their coordinate. With these vectors you can use sklearn.cluster.KMeans suggested by Sam B.

vagoston
  • 171
  • 8
  • Thanks for the clarification, but unfortunately I can't get the dimensions for words (this is the only information I have). If this is the case, I'm guessing KMeans isn't the way to go? And if so, is there a clustering algorithm that would better suit the data? – coolbeans Aug 16 '17 at 21:02
  • Based on pairwise distances, unfortunately, you won't be able to run K-Means. Check this for options: https://stackoverflow.com/questions/18909096/clustering-given-pairwise-distances-with-unknown-cluster-number – vagoston Aug 17 '17 at 14:10
0

Yes, if you are using python you should checkout the scikit-learn package, specifically the sklearn.cluster.KMeans function:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Sam B
  • 49
  • 1
  • 1
  • 6
0

K-means cannot use a distance matrix. It doesn't use pairwise distance, but rather it only uses point-to-center distances, and the means will move each iteration, so this cannot be precomputed.

You can try e.g. Hierarchical Clustering instead. You could also try DBSCAN, OPTICS,... but these likely won't give good results on a text collection (well, it's not as if k-means or hierarchical would work well either)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194