We have been using the Jaro Winkler fuzzy matching algorithm implementation from Apache Commons text and whilst studying the code we found a potential flaw.
It seems that this implementation is based on the very comprehensible Wikipedia article about Jaro-Winkler:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Jaro Winkler uses a formula to calculate proximity of two strings. The output is typically a double number from 0.0 to 1.0. The internal formula of Jar-Winkler uses the number of matches, transpositions and the common prefix length as input.
Whilst studying the Apache Commons Jaro Winkler implementation (see https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/JaroWinklerDistance.java.html) we saw this code for prefix length extraction:
int prefix = 0;
for (int mi = 0; mi < min.length(); mi++) {
if (first.charAt(mi) == second.charAt(mi)) {
prefix++;
} else {
break;
}
}
This code looks correct, but somehow does not match the specification for prefix length extraction on the Wikipedia article:
l is the length of common prefix at the start of the string up to a maximum of four characters
According to my understanding the prefix match size should never exceed 4.
The Apache Commons Text implementation will indeed boost the score of strings with long common prefixes. For example:
"john.fernandez@onepointltd.com" - "john.fernandez@onepointlt.co"
if evaluated by the Apache Commons text implementation will return 1.0 which means we have a full match. This does not feel right.
My question to the community is: should the Apache implementation not limit the prefix length to at most 4?