I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
- Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
- Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
- Is k-means the right algorithm?