2

I've implemented the Levenshtein distance to do signal alignment. There are cases where Levenshtein doesn't find the solution I want, although it's optimal. For example, I have the strings:

  aaabaa
abaaabaaa

The algorithm should recognize that it needs to delete the first two and the last character to match the strings:

abaaabaaa
x      xx

Instead it finds:

abaaabaaa
 x  x   x

Thus it divides the string in more substring than it needs to. Is there an extension to the Levenshtein distance, which divides the string in fewest substrings?

Sayakiss
  • 6,878
  • 8
  • 61
  • 107
Henste93
  • 273
  • 4
  • 21
  • have you seen this question? http://stackoverflow.com/questions/10425238/modifying-levenshtein-distance-for-positional-bias?rq=1 Two alternative algorithms get mentioned, that might help you. – Jens Schauder Dec 21 '15 at 06:45

1 Answers1

0

You can introduce a more complicated edit cost function than the the one the Levenshtein distance uses. You could make n consecutive deletions (or n consecutive inserts) cheaper than n separated delets (or inserts).

This would make the solution you want cheaper than the one the Levenshtein distance found.

Example of a edit cost function which should meet your needs:

cost of replace: 2
cost of fist insert: 2
cost of consecutive insert: 1
cost of fist delete: 2
cost of consecutive delete: 1

than

abaaabaaa
x      xx

would have the edit costs: 5

and

abaaabaaa
 x  x   x

would have the edit costs: 6

So the found solution would be the one you desire with edit distance: 5

MrSmith42
  • 9,961
  • 6
  • 38
  • 49