I have a CSV with the following dataset:
similarity | doc_id1 | doc_id2
1 | 34 | 0
1 | 29 | 6
0.997801748 | 22 | 10
0.966014701 | 35 | 16
0.964811948 | 14 | 13
Where "similarity" refers to a value from tf-idf cosine similarity computations and the doc_ids refer to documents. So, the closer similarity is to 1, the more similar the two documents are.
I want to cluster the documents based on this information, but I'm not entirely sure how to do so. I've been reading a lot about spherical K-means clustering, but in terms of implementing it I'm having a hard time wrapping my head around it. Is there a library that might be useful? Is K-means the right way to go at all?
EDIT: This CSV is all I have, so even though I wish I had word frequency based vectors, I don't. If K-means won't work given that all I have are similarities, are there other algorithms that would suit this data?