4

I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the Damerau–Levenshtein distance algorithm. For example:

import jellyfish as jf
In [0]: jf.damerau_levenshtein_distance('ZX', 'XYZ')
Out[0]: 3
In [1]: jf.damerau_levenshtein_distance('BADC', 'ABCD')
Out[1]: 3

To my mind both should score 2.

In the first example:

  1. ZXXZ (transpose adjacent characters)
  2. XZXYZ (insert Y)

In the second example:

  1. BACDABDC (transpose adjacent BA characters)
  2. ABDCABCD (transpose adjacent DC characters)

Is this something wrong with the algorithm, or have I misunderstood the measure? Any guidance would be appreciated.

EDIT

Just to make things more peculiar, I also observe the following:

In [3]: jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')
Out[3]: 2
In [4]: jf.damerau_levenshtein_distance('ifhs', 'fish')
Out[4]L 3

Which is particularly odd, as the number of edits should not only be two in both examples but they are exactly the same edits:

In the third example:

  1. jellyifhsjellyfihs (transpose adjacent characters if)
  2. jellyfihsjellyfish (transpose adjacent characters hs)

In the fourth example:

  1. ifhsfihs (transpose adjacent characters if)
  2. fihsfish (transpose adjacent characters hs)
Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
Woody Pride
  • 13,539
  • 9
  • 48
  • 62
  • I think transposing counts as two steps. – aIKid Nov 28 '13 at 06:15
  • @aIKid: Transposition of two adjacent characters is a single operation/step. – 0xc0de Nov 28 '13 at 06:19
  • 1
    +1, Looks like they have implemented OSA instead of Damerau–Levenshtein distance. – 0xc0de Nov 28 '13 at 07:40
  • thats what I though... actually I'm also finding errors with the jaro_distance. I am on github but never use it, is it better to email them direct or or do something on github? I think I will update my question here to also raise that issue – Woody Pride Nov 28 '13 at 08:00

1 Answers1

3

From wikipedia

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein)[citation needed] is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.

But if you read further,

Take for example the edit distance between CA and ABC. The Damerau–Levenshtein distance LD(CA,ABC) = 2 because CA -> AC -> ABC, but the optimal string alignment distance OSA(CA,ABC) = 3 because if the operation CA -> AC is used, it is not possible to use AC -> ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CA -> A -> AB -> ABC. Note that for the optimal string alignment distance, the triangle inequality does not hold: OSA(CA,AC) + OSA(AC,ABC) < OSA(CA,ABC), and so it is not a true metric.

EDIT:

After taking a look at the source, it's clear that the function calculates OSA instead of Damerau–Levenshtein distance.

0xc0de
  • 8,028
  • 5
  • 49
  • 75
  • 1
    This seems to pretty clearly state that "The Damerau–Levenshtein distance LD(CA,ABC) = 2" which is why I am surpirsed when the same problem implemented in jellyfish returns 3 – Woody Pride Nov 28 '13 at 06:28
  • 1
    Yes, my bad, and looks like the implementation computes OSA. Filed an issue https://github.com/sunlightlabs/jellyfish/issues/13. – 0xc0de Nov 28 '13 at 07:46
  • thats what I thought... actually I'm also finding errors with the jaro_distance. I am on github but never use it, is it better to email them direct or or do something on github? – Woody Pride Nov 28 '13 at 07:57
  • Raising issues on github is always better, so as other (like other users) will also come to know about the issue, also some of them (who might not be the repository owners) can submit fixes to the issue. – 0xc0de Nov 28 '13 at 08:00
  • Cool, I think i'll raise another here. The thing is I am not sure if I am getting it wrong, or the code is wrong. I'm not an expert on these measures... – Woody Pride Nov 28 '13 at 08:46
  • @WoodyPride: I have went through the code, that's why I filed the bug :). – 0xc0de Nov 28 '13 at 13:39
  • I actually meant for the updated problem. The jaro_distance does not seem to be calcuating properly either. I put an example above. I have not looked at source code yet, but i did add some description to your issues log. – Woody Pride Nov 28 '13 at 13:46