-2

I created a script to calculate the Levenshtein distance of two strings. Now I want to group a list of strings in base of the Levenshtein distance. (if strings have a distance under a threshold, they will be in the same groub):

At the moment, I have done something but doesn't seems to work. Here is a pseudo-code:

for every string in list:
    create a new cluster with this string
    remove the string from the list
    for every string in the remaining list:
        if distance(string1,string2) < threshold:
             add string2 to the cluster
             remove string2 from the list

Here is the real code since a few users asked it:

cid = 0
clusters = {}

numb = range(len(mylist))
for i in numb:
        cls = [mylist[i]]
        numb.remove(i)
        for j in numb:
            if distance(mylist[i],mylist[j]) <= threshold:
                cls.append(mylist[j])
                numb.remove(j)

        clusters[cid] = cls      
        cid+=1
        cls = []
Sfinos
  • 279
  • 4
  • 15

2 Answers2

0

You are removing items from the list you are iterating over. That will change the indexes and cause unexpected behavior you should copy the values you need to a new list rather than modifying the list you are iterating on.

kylieCatt
  • 10,672
  • 5
  • 43
  • 51
0

Here is some cleaned up code:

clusters = defaultdict(list)

numb = range(len(mylist))
for i in numb:
        for j in range(i+1, len(numb)):
            if distance(mylist[i],mylist[j]) <= threshold:
                clusters[i].append(mylist[j])
                clusters[j].append(mylist[i])
grdvnl
  • 636
  • 6
  • 9