Jaccard similarity in stringdist package to match words in character string

Question

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string.

c <- c('cat', 'dog', 'person')
d <- c('cat', 'dog', 'ufo')

stringdist(c, d, method='jaccard', q=2)
[1] 0 0 1

So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'.

I also tried converting the words into 1 long text string. The following approaches what I need, but it's still calculating 1 - (number of shared 2-grams / number of total unique 2-grams):

f <- 'cat dog person'
g <- 'cat dog ufo'
stringdist(f, g, method='jaccard', q=2)
[1] 0.5625

How would I get it to calculate similarity by the words?

Please explain better your desired outcome. The first instance calculates the difference between each word in order. Are you interested comparing two bags of words (unordered sets)? — lmo, May 10 '16 at 16:38

Psidom · Answer 1 · 2016-05-10T16:44:36.133

5

You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist() to calculate the distance.

library(hashr); library(stringdist)
f <- 'cat dog person'
g <- 'cat dog ufo'
seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
[1] 0.6666667

edited May 10 '16 at 16:44

answered May 10 '16 at 16:38

Psidom

209,562
33
339
356

This measure could also be achieved in the OP's first example: `wordSim <- 1 - stringdist(c, d, method='jaccard', q=2); sum(wordSim) / length(wordSim)`. – lmo May 10 '16 at 16:43
I think for the example OP gives, yes. But generally it may not be correct. Consider this example `c <- c('cat', 'dog', 'person'); d <- c('cat', 'dog', 'upon'); stringdist(c, d, method='jaccard', q=2); [1] 0.0000000 0.0000000 0.8571429` – Psidom May 10 '16 at 16:48
1

These are interesting suggestions. Thank you. I was initially looking at the stringdist package as a faster alternative to just calculating the Jaccard similartiy manually: `c <- c('cat', 'dog', 'person') d <- c('cat', 'dog', 'ufo') length(intersect(c, d)) / length(union(c,d))`. It looks this simple method is the best. It's interesting to note that this has 1 for highest similarity, whereas the stringdist formula has 0 for highest similarity. – matsuo_basho May 11 '16 at 12:44
@Psidom, I was hoping you would be able to share some solution to a similar query I had on usage of stringdistmatrix. I have posted the same as a question at - http://stackoverflow.com/questions/42486172/r-string-match-for-address-using-stringdist-stringdistmatrix Hope you would be able to help!! – user1412 Mar 04 '17 at 13:15

Jaccard similarity in stringdist package to match words in character string

1 Answers1

Linked