0

I am interested in the Jaro-Winkler module written in Perl to compute the distance (or similarity) between two strings:

http://search.cpan.org/~scw/Text-JaroWinkler-0.1/JaroWinkler.pm

The syntax of the function is not clear to me; I could not find any clear documentation of it.

Here is the sample code:

#!/usr/bin/perl

use 5.10.0;
use Text::JaroWinkler qw( strcmp95 );
print strcmp95("it is a dog","i am a dog.",11);

What exactly does the 11 represent? I gather it is a length. Which length? The length of the amount of characters I want checked? Is it required to be there?

paso
  • 168
  • 10
  • I have actually been using that module recently. I don't know exactly what the 11 does. What I have learned is that I get the best results when I set it to the maximum length of the two strings. – Alex Feb 22 '13 at 01:42
  • Thank you @Alex ! What do you mean by "best results?" – paso Feb 22 '13 at 02:03
  • I don't remember exactly what happened when it was not the maximum, I have all of it set up on my work computer so I am unable to check at the moment. I think it simply returned inaccurate results (either just 0 or 1). So the argument may be telling it how many letters to match. In their example, both strings are the exact same length, which is nice for an example, but not too nice for any real-world application. If I were to venture a guess, I would say that it means to "match at most this many characters", but that's just a guess. – Alex Feb 22 '13 at 02:15
  • Thank you Alex. I would greatly appreciate if you could check when you are at work next. My hunch is that the third parameter sets the length of the comparison such that the length of the comparison made is min(length(first term in function), length(second term in function), third term specified). – paso Feb 22 '13 at 13:48
  • Alex: Is there some sort of industry standard that gives a sense of how large a Jaro-Winkler score should be to say that the two strings are likely similar? – paso Feb 22 '13 at 15:45
  • I don't know about any standard, I think it all changes depending on what you need. For example, if you're comparing small strings, you would probably want a higher score than if you're comparing long strings. It also changes depending on how many miss-matches you want to allow. For what I needed, through a few hundred tests of strings ranging from about 3-30 characters, I found a 0.94 to be optimal. However, that is really not a standard of any sort. – Alex Feb 24 '13 at 08:20

1 Answers1

2

See the source for an answer to your question. It contains this line:

$ying = sprintf("%*.*s", -$y_length, $y_length, $ying);

So $y_length is being used to reformat the strings, padding them if necessary and trimming them to an identical length. These equal-length strings are then fed into the actual comparison function. This suggests that Alex is correct and giving a length of max(length $ying, length $yang) is going to give the best results under most circumstances.

Reading the source also reveals that if you fail to supply $y_length, no default is supplied. So you'll be comparing the empty string to the empty string. Those should have a pretty short JW distance.

darch
  • 4,200
  • 1
  • 20
  • 23
  • Thank you for finding that and for posting! Are you saying that the "11" in the sample code meant to make the comparison up to 11 characters? If the number were larger than the lengths of both of the strings fed to the function, what would be "padded"? – paso Feb 22 '13 at 12:01
  • Yes, the number specifies how much of the strings to compare. If it were larger than the length of the original strings, those strings would be padded to the specified length with spaces. See `perldoc -f sprintf` for precise details about how the `sprintf` argument works in the general case. – darch Feb 22 '13 at 18:18
  • Thank you! Do you know how the "spaces" affect the score? Would they be different than say randomly asserted X's or say randomly inserted P's? – paso Feb 22 '13 at 18:37
  • @user2096518 I have no actual idea, but surmise from the fact that JW is an edit distance that what character is used to pad should have no effect on the value of the function. But for the real answer, test it and see. – darch Feb 22 '13 at 18:41