I am trying to use Jellyfish to work with fuzzy strings. I am noticing some strange behaviour of the jaro_distance algorithm.
I had some issues previously with the damerau_levenshtein_distance algorithm which appeared to be a bug in the code, which a stack user then raised as an issue on github.
I am not sure if I am thinking about the measure wrong, or if it is a genuine bug. I have looked at the source code (http://goo.gl/YVMl8k), but im not familiar with C, so its hard for me to know if this is an implementation problem, or I am just wrong.
Observe the following:
In [1]: S1 = Poverty
In [2]: S2 = Poervty
In [3]: jf.jaro_distance(S3, S4)
Out[3]: 0.95238095
Now if my understanding of the jarrow distance measure is correct, I believe the result should be 0.9285714285
I have identified why the calcualtion is going wrong. To calculate the measure I believe the followig is correct:
(7.0/7.0 + 7.0/7.0 + ((7.0 - (3.0/2.0))/7.0) * (1.0/3.0) = 0.9285714285
The critical number in that expression is the 3.0. This number must represent "The number of matching (but different sequence order)" (wikipedia). To my mind in S1 and S2 the characters that match but are in difference sequence order are 'e', 'r', 'v'.
However, JellyFish seems to only identify two transpositions as it is calculating:
(7.0/7.0 + 7.0/7.0 + ((7.0 - (2.0/2.0))/7.0) * (1.0/3.0) = 0.95238095
Am I wrong on this, or is there something bad in the function?