3

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string.

c <- c('cat', 'dog', 'person')
d <- c('cat', 'dog', 'ufo')

stringdist(c, d, method='jaccard', q=2)
[1] 0 0 1

So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'.

I also tried converting the words into 1 long text string. The following approaches what I need, but it's still calculating 1 - (number of shared 2-grams / number of total unique 2-grams):

f <- 'cat dog person'
g <- 'cat dog ufo'
stringdist(f, g, method='jaccard', q=2)
[1] 0.5625

How would I get it to calculate similarity by the words?

matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
  • 2
    Please explain better your desired outcome. The first instance calculates the difference between each word in order. Are you interested comparing two bags of words (unordered sets)? – lmo May 10 '16 at 16:38

1 Answers1

5

You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist() to calculate the distance.

library(hashr); library(stringdist)
f <- 'cat dog person'
g <- 'cat dog ufo'
seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
[1] 0.6666667
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • This measure could also be achieved in the OP's first example: `wordSim <- 1 - stringdist(c, d, method='jaccard', q=2); sum(wordSim) / length(wordSim)`. – lmo May 10 '16 at 16:43
  • I think for the example OP gives, yes. But generally it may not be correct. Consider this example `c <- c('cat', 'dog', 'person'); d <- c('cat', 'dog', 'upon'); stringdist(c, d, method='jaccard', q=2); [1] 0.0000000 0.0000000 0.8571429` – Psidom May 10 '16 at 16:48
  • 1
    These are interesting suggestions. Thank you. I was initially looking at the stringdist package as a faster alternative to just calculating the Jaccard similartiy manually: `c <- c('cat', 'dog', 'person') d <- c('cat', 'dog', 'ufo') length(intersect(c, d)) / length(union(c,d))`. It looks this simple method is the best. It's interesting to note that this has 1 for highest similarity, whereas the stringdist formula has 0 for highest similarity. – matsuo_basho May 11 '16 at 12:44
  • @Psidom, I was hoping you would be able to share some solution to a similar query I had on usage of stringdistmatrix. I have posted the same as a question at - http://stackoverflow.com/questions/42486172/r-string-match-for-address-using-stringdist-stringdistmatrix Hope you would be able to help!! – user1412 Mar 04 '17 at 13:15