0

I have a python program to read two lists (one with errors and other with correct data). Every element in my list with error needs to be compared with every element in my correct list. After comparing i get all the edit distance between every compared pair. now i can find the least edit distance for the given error data and thru get my correct data.

I am trying to use levenshtein distance to calculate the edit distance but its returning all edit distance as 1 even which is wrong.

This means the code for calculating levenshtein distance is not correct. I am struggling to find a fix for this. HELP!

My Code

import csv

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

if __name__ == "__main__":

    with open("all_correct_promo.csv","rb") as file1:
        reader1 = csv.reader(file1)
        correctPromoList = list(reader1)
        #print correctPromoList

    with open("all_extracted_promo.csv","rb") as file2:
        reader2 = csv.reader(file2)
        extractedPromoList = list(reader2)
        #print extractedPromoList

    incorrectPromo = []
    count = 0
    for extracted in extractedPromoList:
        if(extracted not in correctPromoList):
            incorrectPromo.append(extracted)
        else:
            count = count + 1
    #print incorrectPromo

    for promos in incorrectPromo:
        for correctPromo in correctPromoList:
            distance = lev(promos,correctPromo)
            print promos, correctPromo , distance
safwan
  • 87
  • 1
  • 1
  • 10
  • As I posted in my answer, your implmentation seems correct (althought I recommend you a better one). If you need to fix this one anyway, please provide the case where your algorithm is incorrectly returning 1 (which I could not reproduce myself) – caspillaga Nov 24 '16 at 13:23

2 Answers2

1

The implementation is correct. I tested this:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)


print lev('abcde','bc') # prints 3, which is correct
print lev('abc','bc') # prints 1, which is correct

Your problem, as I noticed by your comments, is probably when you call the method:

a = ['NSP-212690']
b = ['FE SV X']

print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings
print lev(a[0],b[0]) # prints 10, which is correct

So, what you can do is:

Before the call to "lev(a,b)", extract the first element of each array

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

a = ['NSP-212690']
b = ['FE SV X']
a = a[0] # this is the key part
b = b[0] # and this
print lev(a,b) # prints 10, which is correct

Anyway, I would not recommend you that recursive implementation, because the performance is veeeery poor

I would recommend this implementation instead (source: wikipedia-levenshtein)

def lev(seq1, seq2):
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]

or maybe this slightly modified version:

def lev(seq1, seq2):
    if not a: return len(b)
    if not b: return len(a)
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]
caspillaga
  • 573
  • 4
  • 16
  • I tried both the codes and still its giving edit distance 1 for all compared promo codes. ['NSP-212690'] ['FE SV X'] 1 ['NSP-212690'] ['FE SV2 A'] 1 ['NSP-212690'] ['FE SV2 B'] 1 ['NSP-212690'] ['FE SV2 C'] 1 ['NSP-212690'] ['FE SV2 D'] 1 ['NSP-212690'] ['FE SV2 E'] 1 ['NSP-212690'] ['FLATRATE DS1'] 1 ['NSP-212690'] ['FLATRATE DS1c'] 1 ['NSP-212690'] ['FLATRATE DS3'] 1 ['NSP-212690'] ['FreeMM'] 1 ['NSP-212690'] ['FreeWeb'] 1 ['NSP-212690'] ['FTTCIP10'] 1 – safwan Nov 25 '16 at 04:36
  • Ah! now I see your problem! I will update my answer in a few seconds... please read it! – caspillaga Nov 25 '16 at 13:15
  • I just updated my answer. I think that your problem was on the method call, not in the method itself. You are comparing arrays instead of strings, so you should extract the first (and only) element of each array before the method call. a = ['NSP-212690'] ---> a[0] = 'NSP-212690' – caspillaga Nov 25 '16 at 13:44
1

There is a Python package available that implements the levenshtein distance : python-levenshtein

To install it:

pip install python-levenshtein

To use it:

>>> import Levenshtein
>>> string1 = 'dsfjksdjs'
>>> string2 = 'dsfiksjsd'
>>> print Levenshtein.distance(string1, string2)
3
Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69