0

Lets say I have the following words:

word1 = 'john lennon'
word2 = 'john lenon'
word3 = 'lennon john'

Its almost clear that these 3 words are reffering to the same person. Having the following code:

library(stringdist)
>stringdist('john lennon','john lenon',method = 'jw')
[1] 0.06363636
>stringdist('john lennon','lennon john',method = 'qgram')
[1] 0
>stringdist('john lennon','lennon john',method = 'jw')
[1] 0.33
>stringdist('john lennon','john lenon',method = 'qgram')
[1] 1

Its clear that in this example that qgram works better. But thats only that case. My question is how can I combine these two methods?

jw gives better results but cant 'catch' the reversed words ( in my case name-surname with surname-name). Any advice?

Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100

2 Answers2

1

you could integrate an "if" statement which will run the jw method if and only if the qgram is not equal to 0. i.e. if(stringdist(('john lennon','john lenon',method = 'qgram')!=0){stringdist('john lennon','john lenon',method = 'jw')}

Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Thanks for the answer. But still doesnt take all the cases. For example: `stringdist('john lennon','lenon john',method = 'qgram')` which is not zero using qgram and has high distance when using `jw`. – Mpizos Dimitris Dec 14 '15 at 10:37
  • So to be clear, what answer would you want for the case: `stringdist('john lennon','lenon john')`? – Sotos Dec 14 '15 at 10:45
  • Considering that `stringdist('john lennon','john lenon',method = 'jw')` gives `0.06`, a value of `0.1` for `stringdist('john lennon','lenon john')` would be fair. – Mpizos Dimitris Dec 14 '15 at 10:55
0

I had an idea which computationally seems to be costly, but at least it gives quite nice results.

word1 = 'john lennon'
word2 = 'john lenon'
word3 = 'lennon john'

Firstly remove spaces:

word1b = gsub(' ','',word1)
word2b = gsub(' ','',word2)
word3b = gsub(' ','',word3)

Order them alphabetically:

word1c = paste(sort(unlist(strsplit(word1b, ""))), collapse = "")
word2c = paste(sort(unlist(strsplit(word2b, ""))), collapse = "")
word3c = paste(sort(unlist(strsplit(word3b, ""))), collapse = "")

And finally use jw method:

stringdist(word1c,word2c,method = 'jw')
[1] 0.03333333
stringdist(word1c,word3c,method = 'jw')
[1] 0
stringdist(word2c,word3c,method = 'jw')
[1] 0.03333333

Satisfactory results. Drawback: could have non wanted results in small length words.

Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • If you're comparing character counts, it is easier to use stringdist(word1,word2,method="qgram",q=1), which is not as costly. –  Mar 05 '16 at 11:12