1

I am trying to cluster strings using Kmeans/EM. I have a list of strings (about 70 strings) and I want to cluster them using Levenshtein similarity metric.

So basically, I am trying to implement the clustering part in this research paper: https://ieeexplore.ieee.org/document/7765062/ After doing preprocessing. I was able to formulate the similarity matrix using Levenshtein distance and then I clustered the strings using Hierarchical Clustering as well as using Spectral Clustering but I am unable to do it using Kmeans or EM. This is because in the prior to algorithms that I was able to implement, only Similarity/Distance Matrix is sufficient for clustering. But in case of K-means and EM, I need to somehow represent the text in a Mathematically operable form as we have to find their mean(in case of K-means).

I was able to find a few techniques in order to convert the text into a vector like: 1) Bag of Words 2) TF-IDF 3) doc2vec or word2vec

Should I convert each string into a vector using any of the above methods and then apply Kmeans? Also is it necessary to convert the strings to vector in order to apply K-means or EM? and Lastly I have to implement everything in Python so, using Kmeans from Sklearn doesn't allow me to give a metric of my choice or a similarity matrix. What should I do?

Note: I had found an implementation of K-means on the text where they had converted the text using TF-IDF. And then applied Kmeans (euclidian) but I want to use Levenshtein.

Also Note: I have a list of strings and not text documents, each string is around 20-30 words

Kushagra Bhatia
  • 113
  • 1
  • 8
  • I know it's very old but were you able to solve your issue ? I'm running into the exact same issue (use levenshtein distance in k-means). In my case the strings are 1 to 5 words and do not have any meaning since they are names. I still want similar sounding names to be grouped together though hence levenshtein. – Chapo Apr 24 '23 at 07:12

0 Answers0