-1

I want to divide my word list into some number of clusters using Levenshtein Distance.

data = pd.read_csv("data.csv")
Target_Column = data["words"]
Target = Target_Column.tolist()
clusters = defaultdict(list)
threshold =5
numb = range(len(Target))

for i in numb:
    for j in range(i+1, len(numb)):
        if distance(Target[i],Target[j]) <= threshold:
            clusters[i].append(Target[j])
            clusters[j].append(Target[i])

But as I am running loop over list some clusters are repeated. Please help me to sort this problem

Ajay Jadhav
  • 161
  • 1
  • 1
  • 5

1 Answers1

0

If you only have strings, why not use a set?

Target = set(Target_Column.tolist())

You can also use a default value of a set for your mapping:

clusters = defaultdict(set)

But this requires changing list.append to set.add in your loop.


There is, however, a more pythonic alternative to your code.

I would probably generate a mapping from words to the set of their connections on the fly.

Here is an example assuming words is a set of all words:

clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}

Live example:

>>> distance = lambda x, y: abs(len(x) - len(y))
>>> words = set("abc def abcd abcdefghijk abcdefghijklmnopqrstuv".split())
>>> threshold = 3
>>> for cluster, values in clusters.items():
...     print cluster, ": ", ", ".join(values)
...
abcd :  abcd, abc, def
abc :  abcd, abc, def
abcdefghijk :  abcdefghijk
abcdefghijklmnopqrstuv :  abcdefghijklmnopqrstuv
def :  abcd, abc, def

Increasing threshold we get one big "cluster" for all words:

>>> threshold = 100
>>> clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}
>>> for cluster, values in clusters.items():
...     print cluster, ": ", ", ".join(values)
...
abcd :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abc :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijk :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijklmnopqrstuv :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
def :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • Suppose I have a list with 6 words Target = ('Ajay','Tree','Man','Tiger','Ajad','Trend') Loop will start from 1st word and create cluster using for loop cluster 1 is 'Ajay ','Ajac' but as loop reaches to 4th index it will again create another cluster having words 'Ajac','Ajay'. One option is to delete that word after adding to 1 cluster but it will give me error Index out of range. – Ajay Jadhav May 24 '16 at 06:44
  • @AjayJadhav I've added an example. – Reut Sharabani May 24 '16 at 06:59
  • **I am taking distance from Levenshtein**
    from Levenshtein import distance
    – Ajay Jadhav May 24 '16 at 07:08
  • How is that relevant? Simply replace the distance function with whatever distance function you like over your elements. – Reut Sharabani May 24 '16 at 07:11
  • @ Reut Sharabani Let me clear, I have 1 list containing 1000 words. I just want to create clusters to divide that list into small list Answer should be like cluster_1 = (tree,trend,trim) Cluster_2 = (App,Apple,Pinapple) – Ajay Jadhav May 24 '16 at 07:20
  • Problem is one word is repeated in 2 or more clusters – Ajay Jadhav May 24 '16 at 07:29