1

I'm trying to do approximate String matching between lists of terms terms1 and terms2 where I want to match Strings including typos, different notations, etc. I'm using

amatch(terms1, terms2, method="osa", maxDist=1, nomatch=0)

I want to match e.g. licence and license, but I don't want to match training and raining. So I thought about excluding the 1st character from the approx. matching, so that it is not considered for deletion/substitution, but has to be the same in both Strings. How could this be done or are there any better ways to match correctly?

Any help appreciated!

Alec
  • 100
  • 1
  • 10
  • I am not sure about your idea to exclude the first character: would you be ok matching terms like "rain" and "rainy"? – Ale Aug 31 '17 at 09:05
  • Yes correct it would be ok. Doing the matching only via stemming and exact String matching won't be enough as it would not match licenc/licens or typos. But with approx. matching I have a lot of wrong matches, e.g. "tool" and "cool" or "training" and "raining". I think I should better say the 1st character has to be the same in both Strings so it shouldn't be considered for deletion/transposition in approx.matching – Alec Aug 31 '17 at 09:09
  • A non elegant but working solution could be to loop through the built in `letters` variable and subset your `terms1` and `terms2` by their starting letter to guarantee the first character matches. – Michael Bird Aug 31 '17 at 09:14
  • 3
    Have a look at the `stringdist` package and try some of the different methods for calculating distances. I have had good results with the Jaro-Winkler distance (with p=0.15 or so), which gives less weight to first characters than the Levenshtein distance. – Andrew Gustar Aug 31 '17 at 09:25
  • Thanks @AndrewGustar, that definitely helped me! – Alec Aug 31 '17 at 12:40
  • I'm using Jaro-Winkler now with p=0.1, it's matching licenc/licens now but also licens/lins, which I don't want to be matched. Is there any option to set maxDist or something else so that the 2nd example isn't matched anymore? – Alec Aug 31 '17 at 12:56
  • I'm afraid you will never get 100% success with this sort of thing. It is usually a case of trial and error to find the best parameters and cut-off point, and you have to be prepared for a certain amount of manual checking and/or some remaining errors. – Andrew Gustar Aug 31 '17 at 21:13

0 Answers0