I am trying to compare two lists of company names using the fuzzy string matching Levenshtein algorithm. I already implemented my own version in Python, but despite using numpy and other tricks to try and speed it up, Python is slow and on my data set it takes anywhere between 1 hr and 3 hrs depending on which variation I try to run. Thus I turned to R to try and speed things up. Unfortunately, I have never used R before.
I found stringdist
for R but the issue is I need to be able to control the costs of each action in order to make my results more accurate. I couldn't figure out a way to do this using a premade function. So I tried to write my own. Currently, I have a working (weighted) R Levenshtein function but it's basically line-for-line translated from Python aka it's just as slow, if not worse.
How can I vectorize the Levenshtein algorithm so that it runs quickly in R? I've tried using sapply
to replace my loops through the lists, but if anything, that runs even slower. I think the actual guts of the Levenshtein function need to be vectorized but I have no idea how to go about that.
Is there a resource I've missed or a way for me to implement this myself? I'm at a lost and R has been the opposite of intuitive for me so far.