I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the Damerau–Levenshtein distance algorithm. For example:
import jellyfish as jf
In [0]: jf.damerau_levenshtein_distance('ZX', 'XYZ')
Out[0]: 3
In [1]: jf.damerau_levenshtein_distance('BADC', 'ABCD')
Out[1]: 3
To my mind both should score 2.
In the first example:
ZX
→XZ
(transpose adjacent characters)XZ
→XYZ
(insertY
)
In the second example:
BACD
→ABDC
(transpose adjacentBA
characters)ABDC
→ABCD
(transpose adjacentDC
characters)
Is this something wrong with the algorithm, or have I misunderstood the measure? Any guidance would be appreciated.
EDIT
Just to make things more peculiar, I also observe the following:
In [3]: jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')
Out[3]: 2
In [4]: jf.damerau_levenshtein_distance('ifhs', 'fish')
Out[4]L 3
Which is particularly odd, as the number of edits should not only be two in both examples but they are exactly the same edits:
In the third example:
jellyifhs
→jellyfihs
(transpose adjacent charactersif
)jellyfihs
→jellyfish
(transpose adjacent charactershs
)
In the fourth example:
ifhs
→fihs
(transpose adjacent charactersif
)fihs
→fish
(transpose adjacent charactershs
)