2

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.

I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weigt, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf

My search in this area is going on still, but I wanted to get ideas from here as well. What would you recommend in this case, is anyone aware of any methods for this kind of problem?

Ufuk Can Bicici
  • 3,589
  • 4
  • 28
  • 57

2 Answers2

2

Don't look for clustering. This is misleading. Most algorithms will (more or less forcefully) break your data into a predefined number of groups, no matter what. That k-means isn't the right type of algorithm for your problem should be rather obvious, isn't it?

This sounds very similar; the difference is the scale. A clustering algorithm will produce "macro" clusters, e.g. divide your data set into 10 clusters. What you probably want is that much of your data isn't clustered at all, but you want to want to merge near-duplicate strings, which may stem from errors, right?

Levenshtein distance with a threshold is probably what you need. You can try to accelerate this by using hashing techniques, for example.

Similarly, TF-IDF is the wrong tool. It's used for clustering texts, not strings. TF-IDF is the weight assigned to a single word (string; but it is assumed that this string does not contain spelling errors!) within a larger document. It doesn't work well on short documents, and it won't work at all on single-word strings.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for the reply! I am aware of what you are saying about TF-IDF. But about the clustering, what if I hold the K parameter of clustering high? I thought about a simple Levenshtein+threshold method as well, but then it is not clear how to start generate clusters. I will get a list of words in the scale of millions, so any pairwise O(N^2) comparing method would simply not work. Can you elaborate on this hashing technique more? How should I begin to design a non-clustering method? – Ufuk Can Bicici Nov 07 '14 at 15:58
  • And by the way; I need to do this for the following reason: This step is needed as a "performance improver" for another step of a larger algorithm, which needs that strings which have a high probability to belong to the same entity, be clustered together. – Ufuk Can Bicici Nov 07 '14 at 16:21
  • 1
    Sorry, I don't have links for hashing ready. Essentially, you need to construct hash functions so that misspellings collide. E.g. by not taking letter order into account when hashing, or randomly skipping letters. Lookup **minhash**. K-means really won't work for you, sorry. It's least-squares minimization, it only makes sense for *continuous* numerical data. – Has QUIT--Anony-Mousse Nov 07 '14 at 17:31
  • Thanks for your help; I am going to investigate that hashing method. Finally what about k-medoids method? It doesn't use Euclidian distance and does not need to calculate the mean of the cluster; it accepts the data point with the smallest total distance to all other points in the cluster as the cluster center. Do you have any knowledge about this method? – Ufuk Can Bicici Nov 07 '14 at 18:29
  • k-medoids will still forcefully squeeze your data into k groups, not detect misspellings and variations. You can try k-medoids with Levenshtein distance, but A) it will scale really bad, and B) the results will be all but convincing (maybe try with some 50 examples first, and see if the results match your desired outcome.) – Has QUIT--Anony-Mousse Nov 08 '14 at 12:51
  • I see. I have been reading about minhashing and locality sensitive hashing as you said. I think I have understood them a bit but there is this question still in my mind: How should we start building clusters when using methods like LSH? Do you have any good resources to recommend on this technique? In my googling I failed to find satisfactory explanations or tutorials. – Ufuk Can Bicici Nov 08 '14 at 15:31
  • Forget about clustering. You want to find hash collissions, not clusters. You may also want to check **openrefine** how they do this. – Has QUIT--Anony-Mousse Nov 08 '14 at 16:30
2

I have encountered the same kind of problem. My approach was to create a graph where each string will be a node and each edge will connect two nodes with weight the similarity of those two strings. You can use edit distance or Sorensen for that. I also set a threshold of 0.2 so that my graph will not be complete thus very computationally heavy. After forming the graph you can use community detection algorithms to detect node communities. Each community is formed with nodes that have a lot of edges with each other, so they will be very similar with each other. You can use networkx or igraph to form the graph and identify each community. So each community will be a cluster of strings. I tested this approach with some string that I wanted to cluster. Here are some of the identified clusters.

University cluster University cluster Council cluster Council cluster Committee cluster Committee cluster I visualised the graph with the gephi tool. Hope that helps even if it is quite late.

Anoroah
  • 1,987
  • 2
  • 20
  • 31