I am trying to cluster strings using Kmeans/EM. I have a list of strings (about 70 strings) and I want to cluster them using Levenshtein similarity metric.
So basically, I am trying to implement the clustering part in this research paper: https://ieeexplore.ieee.org/document/7765062/ After doing preprocessing. I was able to formulate the similarity matrix using Levenshtein distance and then I clustered the strings using Hierarchical Clustering as well as using Spectral Clustering but I am unable to do it using Kmeans or EM. This is because in the prior to algorithms that I was able to implement, only Similarity/Distance Matrix is sufficient for clustering. But in case of K-means and EM, I need to somehow represent the text in a Mathematically operable form as we have to find their mean(in case of K-means).
I was able to find a few techniques in order to convert the text into a vector like: 1) Bag of Words 2) TF-IDF 3) doc2vec or word2vec
Should I convert each string into a vector using any of the above methods and then apply Kmeans? Also is it necessary to convert the strings to vector in order to apply K-means or EM? and Lastly I have to implement everything in Python so, using Kmeans from Sklearn doesn't allow me to give a metric of my choice or a similarity matrix. What should I do?
Note: I had found an implementation of K-means on the text where they had converted the text using TF-IDF. And then applied Kmeans (euclidian) but I want to use Levenshtein.
Also Note: I have a list of strings and not text documents, each string is around 20-30 words