1

I have been trying to use the Sim-metrics library from:

    <dependency>
        <groupId>com.github.mpkorstanje</groupId>
        <artifactId>simmetrics-core</artifactId>
        <version>4.1.0</version>
    </dependency>

So far I am computing Jaro Winkler using:

StringMetric sm = StringMetrics.jaroWinkler();
res = sm.compare("Harry Potter", "Potter Harry");
System.out.println(res);

0.43055558

and Cosine Similarity by:

sm  = StringMetrics.overlapCoefficient();
res = sm.compare("The quick brown fox", "The slow brawn fur");
System.out.println(res); 

0.25

but according to https://asecuritysite.com/forensics/simstring

The jaro-winkler should be 0 for this, and the overlap coeffecient should be 100. Is this even the correct way to use this library? What is the proper calls, say if I want to run both these metrics to match movies from one list to another I got from IMDB, I am intending to compare the titles from both sets and get the average of both scores and do the same for the cast from both sets of movies. Thanks

VSEWHGHP
  • 195
  • 2
  • 3
  • 12
  • @mpkorstanje I don't know if you are still the main person maintaining this library, but I thought this might be a good way to get in touch, thanks – VSEWHGHP Jan 18 '16 at 05:54
  • 1
    sadly that trick only works on people who are in the conversation. :) – M.P. Korstanje Jan 19 '16 at 15:22
  • The site you linked above is normalizing the scores so a perfect match (according to the particular algorithm) is 100. For example, the Levenshtein distance between "cat" and "hat" is 1, not 67. It looks like the site calculated the score using (len-dist)/len. I recommend that you validate the library you're using by calculating the Jaro-Winkler distance yourself or use another library for comparison (e.g. http://lucene.apache.org/core/5_4_0/suggest/org/apache/lucene/search/spell/JaroWinklerDistance.html) – Paul Jan 19 '16 at 18:01
  • Looks like the Website is using a rather old and buggy version of Simmetrics. – M.P. Korstanje Jan 19 '16 at 18:23

1 Answers1

1

You are using the library correctly. You may however wish to customize the metric you are using. It sounds like filtering short, common words like 'the', 'a' 'and', ect, and using a q-gram tokenizer might be more effective then using the default metric from StringMetrics most of which tokenize on whitespace and none apply filters or simplifiers.

Beyond that I can't really tell you which combination metrics, tokenizers, filters and simplifiers may work for your use case. What works best is rather domain specific. You'll have to try a few combinations and see what works best.


When I use the website you provided to calculate the Cosine Similarity and Overlap Coefficient of The quick brown fox and The slow brawn fur I get:

String 1: The quick brown fox
String 2: The slow brawn fur

The results are then:
Cosine Similarity   25
Overlap Coefficient 25

When I use Simmetrics.

System.out.println(
  StringMetrics.overlapCoefficient().compare(
    "The quick brown fox", "The slow brawn fur")); // 0.25
System.out.println(
  StringMetrics.cosineSimilarity().compare(
     "The quick brown fox", "The slow brawn fur")); // 0.25

Regarding Jaro Winkler it looks like the website it using an older version of Simmetrics. The specific combination of metrics and names, specifically Chapman Length Deviation, which was originally written by the original author of Simmetrics Sam Chapman leave little doubt about it.

The older versions had some peculiarities though I can't point the specific one which is causing this difference without debugging them side by side again.

M.P. Korstanje
  • 10,426
  • 3
  • 36
  • 58