Levenshtein divides in too many substrings

Question

I've implemented the Levenshtein distance to do signal alignment. There are cases where Levenshtein doesn't find the solution I want, although it's optimal. For example, I have the strings:

  aaabaa
abaaabaaa

The algorithm should recognize that it needs to delete the first two and the last character to match the strings:

abaaabaaa
x      xx

Instead it finds:

abaaabaaa
 x  x   x

Thus it divides the string in more substring than it needs to. Is there an extension to the Levenshtein distance, which divides the string in fewest substrings?

have you seen this question? http://stackoverflow.com/questions/10425238/modifying-levenshtein-distance-for-positional-bias?rq=1 Two alternative algorithms get mentioned, that might help you. — Jens Schauder, Dec 21 '15 at 06:45

score 0 · Answer 1 · answered Dec 21 '15 at 12:31

You can introduce a more complicated edit cost function than the the one the Levenshtein distance uses. You could make n consecutive deletions (or n consecutive inserts) cheaper than n separated delets (or inserts).

This would make the solution you want cheaper than the one the Levenshtein distance found.

Example of a edit cost function which should meet your needs:

cost of replace: 2
cost of fist insert: 2
cost of consecutive insert: 1
cost of fist delete: 2
cost of consecutive delete: 1

than

abaaabaaa
x      xx

would have the edit costs: 5

and

abaaabaaa
 x  x   x

would have the edit costs: 6

So the found solution would be the one you desire with edit distance: 5

Levenshtein divides in too many substrings

1 Answers1