2

I was trying to complete an NLP assignment using the Jaccard Distance metric function jaccard_distance() built into nltk.metrics.distance, when I noticed that the results from it did not make sense in the context I would expect.

When I examined the implementation of jaccard_distance() in the online source, I noticed that it was not consistent with the mathematical definition for the Jaccard index.

Specifically, the implementation in nltk is:

return (len(label1.union(label2)) - len(label1.intersection(label2)))/len(label1.union(label2))

but according to the definition, the numerator term should only involve an intersection of the two sets, which means the correct implementation should be:

return len(label1.intersection(label2))/len(label1.union(label2))

when I wrote my own function using the latter, I indeed obtained correct answers to my assignment. For example, I was tasked to recommend a correct spelling suggestion for the misspelled word cormulent, from a comprehensive corpus of words (built in nltk), using Jaccard Distance on trigrams of the words.

When I used the jaccard_distance() from nltk, I instead obtained so many perfect matches (the result from the distance function was 1.0) that just were nowhere near being correct.

When I used my own function the latter implementation, I was able to get a spelling recommendation of corpulent, at a Jaccard Distance of 0.4 from cormulent, a decent recommendation.

Could there be a bug with jaccard_distance() in nltk?

AKKA
  • 165
  • 4
  • 15
  • 1
    Thank you everyone for your help! I also understand now the numerator term in the `nltk` implementation of `jaccard_distance()`, which arises from performing the 1 - Jaccard Similarity. – AKKA Mar 11 '18 at 09:32

2 Answers2

2

The two formulae you quote do not do the exact same thing, but they are mathematically related. The first definition you quote from the NLTK package is called the Jaccard Distance (DJaccard). The second one you quote is called the Jaccard Similarity (SimJaccard).

Mathematically, DJaccard = 1 - SimJaccard. The intuition here is that the more similar they are (the higher the SimJaccard), the lower is the distance (and hence, DJaccard).

cs95
  • 379,657
  • 97
  • 704
  • 746
2

Are you sure you are not confusing Jaccard's index with Jaccard's distance?

The first indeed should be calculated as you suggest whereas the second is 1-Jaccard_index(A,B) which is exactly as in the NLTK implementation.

The implementation is faster (0.83 vs. 1.29s = ~35%) with the following change:

def jaccard_distance(label1, label2):
    len_union = len(label1.union(label2))
    return (len_union - len(label1.intersection(label2)))/len_union

You can repeat my test in the following way (the structure of the sets will change the timing - this is only an example):

from timeit import timeit

a = {1,4,6,7,5,7,9,234}
b = {1,43,66,7,85,7,89,234}

def jaccard_distance(label1, label2):
    len_union = len(label1.union(label2))
    return (len_union - len(label1.intersection(label2))) / len_union

def jaccard_distance2(label1, label2):
    return (len(label1.union(label2)) - len(label1.intersection(label2))) / len(label1.union(label2))


s1 = """a = {1,4,6,7,5,7,9,234}
b = {1,43,66,7,85,7,89,234}
def jaccard_distance(label1, label2):
     len_union = len(label1.union(label2))
     return (len_union - len(label1.intersection(label2))) / len_union
for i in range(100000):
     jaccard_distance(a,b)"""

s2 = """a = {1,4,6,7,5,7,9,234}
b = {1,43,66,7,85,7,89,234}
def jaccard_distance2(label1, label2):
     return (len(label1.union(label2)) - len(label1.intersection(label2))) / len(label1.union(label2))
for i in range(100000):
     jaccard_distance2(a,b)"""

print(timeit(stmt=s1, number=10))
print(timeit(stmt=s2, number=10))
sophros
  • 14,672
  • 11
  • 46
  • 75