7

Here's the textbook example of the general algorithm to calculate Levenshtein Distance (I've pulled from Magnus Hetland's webite):

def levenshtein(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)

    return current[n]

I was wondering, however, if there might be a more efficient (and potentially more elegant) pure Python implementation that uses difflib's SequenceManager. After playing around with it, here's what I came up with:

from difflib import SequenceMatcher as sm

def lev_using_difflib(s1, s2):
    a = b = size = distance = 0
    for m in sm(a=s1, b=s2).get_matching_blocks():
        distance += max(m.a-a, m.b-b) - size
        a, b, size = m
    return distance

I can't come up with a test case where it fails, and the performance seems to be significantly better than the standard algorithm.

Here are the results with levenshtein algorithm that relies on difflib:

>>> from timeit import Timer
>>> setup = """
... from difflib import SequenceMatcher as sm
... 
... def lev_using_difflib(s1, s2):
...     a = b = size = distance = 0
...     for m in sm(a=s1, b=s2).get_matching_blocks():
...         distance += max(m.a-a, m.b-b) - size
...         a, b, size = m
...     return distance
... 
... strings = [('sunday','saturday'),
...            ('fitting','babysitting'),
...            ('rosettacode','raisethysword')]
... """
>>> stmt = """
... for s in strings:
...     lev_using_difflib(*s)
... """
>>> Timer(stmt, setup).timeit(100000)
36.989389181137085

And here's the standard pure python implementation:

>>> from timeit import Timer
>>> setup2 = """
... def levenshtein(a,b):
...     n, m = len(a), len(b)
...     if n > m:
...         a,b = b,a
...         n,m = m,n
... 
...     current = range(n+1)
...     for i in range(1,m+1):
...         previous, current = current, [i]+[0]*n
...         for j in range(1,n+1):
...             add, delete = previous[j]+1, current[j-1]+1
...             change = previous[j-1]
...             if a[j-1] != b[i-1]:
...                 change = change + 1
...             current[j] = min(add, delete, change)
... 
...     return current[n]
... 
... strings = [('sunday','saturday'),
...            ('fitting','babysitting'),
...            ('rosettacode','raisethysword')]
... """
>>> stmt2 = """
... for s in strings:
...     levenshtein(*s)
... """
>>> Timer(stmt2, setup2).timeit(100000)
55.594768047332764

Is the performance of the algorithm using difflib's SequenceMatcher really better? Or is it relying on a C library that invalidates the comparison completely? If it is relying on C extensions, how can I tell by looking at the difflib.py implementation?

Using Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)]

Thanks in advance for your help!

damzam
  • 1,921
  • 15
  • 18
  • The source for `SequenceMatcher` isn't too long. Just skim it. – Blender Sep 30 '12 at 07:49
  • @Blender I did...these only things that appeared to be implemented in C were the deque and default dict from the collections model. But it didn't look like either of those was being used for the Sequence Matcher. That being said, I'm a little out of my element trying to understand how C extensions are used. – damzam Sep 30 '12 at 08:27
  • 1
    It seems (from the SequenceMatcher documentation) that the algorithm SequenceMatcher uses is not guaranteed to generate a minimal number of edits, but a more "intuitive" set of edits. Levenshtein leans the opposite way. Have you tried generating many pairs of long, random strings and feeding those as input to your two routines? That might be a better testing strategy. – Sam Mussmann Sep 30 '12 at 21:05
  • @SamMussmann My testing strategy was clearly inadequate. There are cases where the results are incorrect. – damzam Oct 02 '12 at 21:20

1 Answers1

4
>>> levenshtein('hello', 'world')
4
>>> lev_using_difflib('hello', 'world')
5
Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
  • Thanks Gareth. I should have tested for correctness more thoroughly before posting and running performance tests. – damzam Oct 02 '12 at 21:22