Levenshtein-distance algorithm

Question

def worddistance(source, target):
    ''' Return the Levenshtein  distance between 2 strings '''  

    if len(source) > len(target):
        source, target = target, source
    #Now target becomes the larger string, if it is 0, surely len(source) is 0?
    if len(target) == 0: 
        return len(source)

  ### Continue on to calculate distance.

Isn't it the same as saying if both the parameters are the same, return 0?

I am not exactly sure what this part of the function is trying to achieve

@MartijnPieters I am not sure if it does anything else. If not, I would just simplicity it. — Gavin, Oct 31 '14 at 08:00

Martijn Pieters · Answer 1 · 2014-10-31T08:57:46.097

1

Yes, the code returns 0 if both are length 0. You can see almost the same style in the Wikibooks implementation; but the coder here simply hasn't thought the code through.

You can simply change that second test to:

if not target:
    return 0

and not change the meaning.

The Wikibooks implementation tests source however:

if not source:
    return len(target)

which makes much more sense.

The function would do more work after that line; it is merely a boundaries check. With the check gone, the algorithm would still work just less efficiently; the Wikibooks version would produce a series of 1-element lists ranging from [1] through to [len(target)] then return that last element; so len(target).

edited Oct 31 '14 at 08:57

answered Oct 31 '14 at 08:01

Martijn Pieters

1,048,767
296
4,058
3,343

1

`if not target: return len(source)`. Returning `0` if `len(target)` is `0` is wrong. – Matthias Oct 31 '14 at 08:03
1

@Matthias: no, because the line before *already determined that source is the same length or shorter* – Martijn Pieters Oct 31 '14 at 08:03
Oh, I mixed up `<` and `>`in my head. – Matthias Oct 31 '14 at 08:04
To add on, wouldn't it be neater to test if both strings are actually equal? If yes, return 0. Instead of testing if both are length 0 – Gavin Oct 31 '14 at 08:05
@georg: Right. That case has to be covered later in the code. – Matthias Oct 31 '14 at 08:06
@MaTaKazer: but then you are looping over the full strings anyway. Sure, it is masked by Python C code, but from an algorithm POV that's doing the work twice. – Martijn Pieters Oct 31 '14 at 08:08
@MaTaKazer Even better: find the longest common prefix of the two strings, then the longest common suffix. If the strings are equal, the algorithm finishes immediately after. If they're not equal but share common prefixes or suffixes, the quadratic-time part gets smaller inputs. – Fred Foo Oct 31 '14 at 09:30

Levenshtein-distance algorithm

1 Answers1