0

I want to calculate the Levenshtein distance for two strings in two columns in a dataframe. The dataframe looks like this (this is only a part of the data frame, it has approximately 4000 rows).

enter image description here

I want to use the Levenshtein method to get the Levenshtein distance for the strings in column "Source1" compared to "Source2"

Here is my code so far:

#import packages
import pandas as pd
import pyodbc
import Levenshtein as lev
import numpy as np

#Read excel file
df = pd.read_excel(xxx)
df.head(10)

#define arrays
a = df.Source1.to_numpy()
b = df.Source2.to_numpy()

#calculate Levenshtein distance between two arrays
for i,k in zip(a, b):
  print(lev(i, k))

I get the following error:

TypeError Traceback (most recent call last) Input In [79], in <cell line: 6>() 5 #calculate Levenshtein distance between two arrays 6 for i,k in zip(a, b): 7 # print(type(i), type(k)) ----> 8 print(lev(i, k))

TypeError: 'module' object is not callable

Can anyone please advise?

Jaay helped me in the comments. The solution is to use print(lev.distance(i, k))

  • can you post the data in dictionary please – Himanshu Poddar Jul 18 '22 at 10:24
  • have you tried to print(type(i), type(k)) ? – Cadeyrn Jul 18 '22 at 10:31
  • @Cadeyrn when I do that it gives me this: – Pauli du Plooy Jul 18 '22 at 10:33
  • I don't know if this happens because some of the entries in the dataframe is NULL or NaN – Pauli du Plooy Jul 18 '22 at 10:35
  • how can it return 4 classes if there is only two variables? – Cadeyrn Jul 18 '22 at 10:38
  • So I replace the NaN with blanks, if I do print(type(i), type(k)) I get the following: which shows everything is the same type. – Pauli du Plooy Jul 18 '22 at 10:39
  • When I do print(lev(i, k)) I get the following error: TypeError Traceback (most recent call last) Input In [79], in () 5 #calculate Levenshtein distance between two arrays 6 for i,k in zip(a, b): 7 # print(type(i), type(k)) ----> 8 print(lev(i, k)) TypeError: 'module' object is not callable – Pauli du Plooy Jul 18 '22 at 10:40
  • @Cadeyrn it prints more than 4 classes, it prints much more because there is about 4000 rows in the dataframe and this is supposed to calculate the distance between each two strings in all the rows – Pauli du Plooy Jul 18 '22 at 10:43
  • 2
    Hi, in your example 'lev' is the module name, this is not a function so its is not callable. Function may be lev.distance(i, k) – Jaay Jul 18 '22 at 10:44
  • now it looks like this : https://stackoverflow.com/questions/4534438/typeerror-module-object-is-not-callable – Cadeyrn Jul 18 '22 at 10:46
  • @Cadeyrn I tried this: import Levenshtein Levenshtein module 'Levenshtein' from 'C:\\Users\\Pauli.Duplooy\\Anaconda3\\lib\\site-packages\\Levenshtein\\__init__.py'> Levenshtein.Levenshtein class 'Levenshtein._Levenshteinobject' from Levenshtein import Levenshtein Levenshtein class 'Levenshtein._Levenshteinobject' and got the following: Input In [14] module 'Levenshtein' from 'C:\\Users\\Pauli.Duplooy\\Anaconda3\\lib\\site-packages\\Levenshtein\\__init__.py'> ^ SyntaxError: invalid syntax – Pauli du Plooy Jul 18 '22 at 10:56
  • have you tried what @Jaay said? – Cadeyrn Jul 18 '22 at 14:06
  • Hi yes I did, and it worked. I did say that in the edited question :) – Pauli du Plooy Jul 18 '22 at 14:08

0 Answers0