Python: Efficient way to find Levenshtein edit distance in a matrix

Question

I would like to identify the similarity between two lists after that I want to do clustering of descriptions.

          L2D1    L2D2     L2D2 .........L2Dn
  L1D1     0       0.3     0.8............0.5  
  L1D2     0.2     0.7     0.3............0.2
  L1D3     0       0.3     0.8............0.5   
  .        .        .       .              .
  .        .        .       .              .   
  .        .        .       .              .
  L1Dn    0.6      0.1     0.9............0.4           

from Levenshtein import distance
List1 = list(new['Description'])
List2 = list(clean['Description'])

Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)

for i in range(0,len(List1)):
  for j in range(0,len(List2)):
      Matrix[i,j] = distance(List1[i],List2[j])

Since the above method is time consuming as size and length of data.

I tried to compare first five words of description if it matches only then calculate the distance between two string, else move to next description of the list in method2.

#Method2
for i in range(0,len(List1)):
    K1[i]=str(List1[:1]).split()[0:5]
    for j in range(0,len(List2)):
        K1[i]=str(List2[:1]).split()[0:5]
        if (distance(K1[i],K2[j]))==0:
            Matrix[i,j]=distance(List1[i],List2[j])
        else:
            Matrix[i,j]=1000

But as I am new to this missing some logic and getting:

TypeError: 'int' object does not support item assignment

I also want to implement same for next 10 and 100 words. Thanks in advance.

Please provide a working example and test data. The error is probably raised on K1 variable but you do not show how you initialize it... Also I do not understand how your descriptions are stored in your list because you sometimes access it using only List1[:1] and split it and sometimes directly with List1[i]. — T.Lucas, Jan 21 '19 at 10:31
Sorry for skipping that part, say K1=0; description of message is stored in list and its type is pandas series, which has two columns one is index and another one is message description. Same case with list2 but both have different set of mesaage descriptions as well as size. — PCH, Jan 21 '19 at 11:08

score 0 · Answer 1 · answered Jan 21 '19 at 09:25

0

I think, you should check numpy documentation and ndarray class.

Here is little bit pythonic way:

for i, new_value in enumerate(List1):
   for j, clean_value in enumerate(List2):
      Matrix[i,j] = distance(new_value, clean_value)

answered Jan 21 '19 at 09:25

Nikolay Gogol

819
7
6

Python: Efficient way to find Levenshtein edit distance in a matrix

1 Answers1