1

Is there an industry standard for how large the Jaro-Winkler score should be to say that the two strings are likely similar?

I have a list of strings and I want to see if any of them are plausible typographical errors for the name James. I have used the perl module that was written in C, and in turn, whose strings I received from a dataset in stata. (So if there were a Stata module, I'd be all ears!)

Here is the code that I wrote so far in perl to make the comparisons to the string James.

   #!/usr/bin/perl

   use 5.10.0;
   use Text::JaroWinkler qw( strcmp95 );
   use List::Util qw(min max);

   open( my $l,  '<', 'Strings.txt' )          or die "Can't open locations: $!";
   open( my $o,  '>', 'JW.txt' )          or die "Can't open locations: $!";

   while ( my $line = <$l> ) {
    chomp($line);
    my $length = min(length($line),length('James'));
    my $jarow = strcmp95($line, 'JAMES', $length);
    print "$line,'JAMES',$jarow,\n" ;
    print( $o ("$line,'JAMES',$jarow"),"\n" );

  }
close $o;

I'm also not sure whether I'm interpreting the 3rd parameter of the Jaro-Winkler function appropriately or effectively. Perhaps I should be doing length('JAMES') ?

paso
  • 168
  • 10
  • (I removed the C tag, since this question isn't about C code and the module being written in C is irrelevant unless you're actually trying to modify it) – Wooble Feb 22 '13 at 15:54
  • @user2096518: The answer to that lies in the Perl module and is independent of the XS source. The two strings you pass are pre-processed by changing their lengths to the value of the third parameter. Longer strings are truncated and shorter ones are padded on the right with spaces. You should change your `min` to `max`. – Borodin Feb 22 '13 at 18:00
  • 1
    @user2096518: Why have you asked again about the function of the third parameter? It was answered very well in your previous question [*What is the third parameter to Text::JaroWinkler::strcmp95 for?*](http://stackoverflow.com/questions/15015280/). It is an irritation when people don't bother researching a question before posting it on Stack Overflow, but to ignore an answer to your *own question* is unforgiveable. – Borodin Feb 22 '13 at 20:56

1 Answers1

2

Try user-written strgroup from SSC for matching using Levenshtein distance. It comes with a another command called levenshtein that you can use to do this. Some toy code to give you an idea:

ssc install strgroup

input str8 names
Bob
James
Jim
Jameson
end

gen james = "James"

levenshtein names james, gen(LD)

You can then sort by LD to get an idea what might work well in your case.

The other way would be to do this, which creates groups for you:

strgroup names , gen(group) threshold(0.5)

and play around with the threshold.

I don't think a standard exists and these procedures will still entail lots of manual work.

dimitriy
  • 9,077
  • 2
  • 25
  • 50
  • Thank you for this suggestion! I had not known about this, and will definitely consider going with it as I prefer a Stata solution. Do you know how it compares to the Jaro-Winkler distance? That is, the pros and cons between the two? – paso Feb 22 '13 at 17:49
  • I am not sure how they compare in theory or practice. I never really worked with JW. Another approach is to try Google Refine. They offer several string matching algorithms for reconciliation. – dimitriy Feb 22 '13 at 17:56
  • Upon doing further research . . . it looks like Jaro-Winkler is specifically designed to be good for names whereas Levenshtein is good for general typographical errors in strings. And since I'm dealing with names I'm not sure I should go the route of Levenshtein. In addition, Levenshtein does not give a probability does it? It just says the number of characters that must be changed? – paso Feb 22 '13 at 18:17
  • I thought that Jaro-Winkler provides a probability between 0 and 1. Or at least provides a # between 0 and 1 which can be interpreted as a probability, no? – paso Feb 22 '13 at 18:36
  • I meant strgroup and levenshtein. I don't know what the JW does, and I don't know that you can always interpret such numbers as probabilities. It's possible to normalize Levenshtein distance by either the longer or the shorter word length. – dimitriy Feb 22 '13 at 18:51