4

I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the jaro_distance algorithm.

I had some issues previously with the damerau_levenshtein_distance algorithm which appeared to be a bug in the code, which a stack user then raised as an issue on github.

I am not sure if I am thinking about the measure wrong, or if it is a genuine bug. I have looked at the source code (http://goo.gl/YVMl8k), but im not familiar with C, so its hard for me to know if this is an implementation problem, or I am just wrong.

Observe the following:

In [1]: S1 = Poverty
In [2]: S2 = Poervty
In [3]: jf.jaro_distance(S3, S4)
Out[3]: 0.95238095

Now if my understanding of the jarrow distance measure is correct, I believe the result should be 0.9285714285

I have identified why the calcualtion is going wrong. To calculate the measure I believe the followig is correct:

(7.0/7.0 + 7.0/7.0 + ((7.0 - (3.0/2.0))/7.0) * (1.0/3.0) = 0.9285714285

The critical number in that expression is the 3.0. This number must represent "The number of matching (but different sequence order)" (wikipedia). To my mind in S1 and S2 the characters that match but are in difference sequence order are 'e', 'r', 'v'.

However, JellyFish seems to only identify two transpositions as it is calculating:

(7.0/7.0 + 7.0/7.0 + ((7.0 - (2.0/2.0))/7.0) * (1.0/3.0) = 0.95238095

Am I wrong on this, or is there something bad in the function?

Woody Pride
  • 13,539
  • 9
  • 48
  • 62

1 Answers1

3

If you look at the Jellyfish source code jaro.c, you'll see that the number of transpositions is stored in the variable trans_count, which has type long. This means that when this is divided by two:

trans_count /= 2;

this uses C's integer division, which truncates the result. So in your example (POVERTY/POERVTY) the number of transpositions is 3 but this becomes 1 when divided by 2.

Is this right? Well, I tried the following avenues of research:

  1. The Wikipedia article is no help because all the examples have an even number of transpositions. (It gives the Jaro score for MARTHA–MARHTA as 0.944 and the Jaro–Winkler score as 0.961.)

  2. Jaro's 1989 paper is not open access.

  3. Winkler's 1990 paper is ambiguous. All he says is:

    The number of mismatched characters is divided by two to yield the number of transpositions.

    with no indication of whether the division is to be followed by a truncation. Although Winkler gives a number of examples, I find it impossible to reproduce the values using the algorithm he describes in the paper. For example, he gives the J–W score for MARTHA–MARHTA as 0.9667 (see Table 1) and I cannot see how to interpret the text to make this right. So this paper is unhelpful. Perhaps it would be worth writing to Winkler for an explanation?

  4. If you look at the code for the "official string comparator to be used for matching during the 1995 Test Census" (which is based on code written by "Bill Winkler, George McLaughlin and Matt Jaro with modifications by Maureen Lynch") then you'll see that it counts transpositions in the variable N_trans, which has type long, and so truncates the division, agreeing with Jellyfish.

    (This code gives the MARTHA–MARHTA score as 0.9708 due to an additional "long string adjustment".)

So it looks to me as though the behaviour of Jellyfish is at least justifiable on the basis of the historical sources. But it does seem like a mistake because it loses information about the number of transpositions for no principled reason.

Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
  • Fascinating! I mailed the developer about the Levensthein distance bug and he got back to me, I mentioned this, so perhaps he will tell me why they took that decision. After I found that problem I just assumed it was a bug. Seems like the test senses source should be pretty reliable. – Woody Pride Dec 12 '13 at 17:59