Is there an industry standard for how large the Jaro-Winkler score should be to say that the two strings are likely similar?
I have a list of strings and I want to see if any of them are plausible typographical errors for the name James. I have used the perl module that was written in C, and in turn, whose strings I received from a dataset in stata. (So if there were a Stata module, I'd be all ears!)
Here is the code that I wrote so far in perl to make the comparisons to the string James.
#!/usr/bin/perl
use 5.10.0;
use Text::JaroWinkler qw( strcmp95 );
use List::Util qw(min max);
open( my $l, '<', 'Strings.txt' ) or die "Can't open locations: $!";
open( my $o, '>', 'JW.txt' ) or die "Can't open locations: $!";
while ( my $line = <$l> ) {
chomp($line);
my $length = min(length($line),length('James'));
my $jarow = strcmp95($line, 'JAMES', $length);
print "$line,'JAMES',$jarow,\n" ;
print( $o ("$line,'JAMES',$jarow"),"\n" );
}
close $o;
I'm also not sure whether I'm interpreting the 3rd parameter of the Jaro-Winkler function appropriately or effectively. Perhaps I should be doing length('JAMES') ?