0

I am trying to compare two lists of company names using the fuzzy string matching Levenshtein algorithm. I already implemented my own version in Python, but despite using numpy and other tricks to try and speed it up, Python is slow and on my data set it takes anywhere between 1 hr and 3 hrs depending on which variation I try to run. Thus I turned to R to try and speed things up. Unfortunately, I have never used R before.

I found stringdist for R but the issue is I need to be able to control the costs of each action in order to make my results more accurate. I couldn't figure out a way to do this using a premade function. So I tried to write my own. Currently, I have a working (weighted) R Levenshtein function but it's basically line-for-line translated from Python aka it's just as slow, if not worse.

How can I vectorize the Levenshtein algorithm so that it runs quickly in R? I've tried using sapply to replace my loops through the lists, but if anything, that runs even slower. I think the actual guts of the Levenshtein function need to be vectorized but I have no idea how to go about that.

Is there a resource I've missed or a way for me to implement this myself? I'm at a lost and R has been the opposite of intuitive for me so far.

ctrl-z pls
  • 331
  • 6
  • 21
  • Both `stringdist` and the base-R function `adist` implement the Levenshtein distance reasonably quickly, and both allow you to set custom costs/weights for each action. – Andrew Gustar May 12 '17 at 09:43
  • But the issue is my current implementation uses changing costs depending on which index in the word is being computed. This accomplishes weighting the beginning of the words more than the end... I haven't been able to make this work so far without writing it myself. – ctrl-z pls May 12 '17 at 11:00
  • Can you post your R code? – Stewart Macdonald May 12 '17 at 11:19
  • 1
    There are other methods in the `stringdist` package that weight the beginning of words more than the rest - perhaps you could use one of them rather than Levenshtein? I have had good results with method `jw`. – Andrew Gustar May 12 '17 at 11:24
  • I'm looking more closely at the options/functions in `stringdist` now, thank you. Also I had never heard of Jaro-Winkler but that does sound like what I'm trying to do. But I was testing the `stringdist` method with different strings and `method=jw` and I'm not seeing any change when the beginning is different vs when the ending is different... so maybe I'm not correctly understanding how this should work? – ctrl-z pls May 12 '17 at 12:06
  • I found that a value of `p` of around 0.15-0.2 was necessary to give a decent weight to the beginning of words. Try some different values and see which gives the sort of result you are looking for. – Andrew Gustar May 12 '17 at 12:09

0 Answers0