R string-based matching of business names

Question

TL;DR I'd like to match two unequal columns where the values contain business names, and I've tried stringdist's amatch using Jaro-Winkler matching to get close, but not nearly close enough. I am wondering if stringi would be useful here - I just don't quite understand how to use it, excuse my being a noob. I wouldn't ask otherwise but I don't think I'll be able to figure it out myself in time.

For context, there are 2079 business names in one column and 1878 business names in a second column. Many of these contain the business structures as suffixes - i.e. LLC, Inc., INC., Co. etc. - so I trimmed them out with excel before going into R. The names were manually entered into both columns so there are human-entry error variations.

I used this formula:

amatch(match$sales, match$box, maxDist = .25, method =c("jw"), weight = c(d = 1, i = .9, s = .9, t = .9), p= .2, matchNA = FALSE, bt=.25)

I was able to get some results with this, but many matches were duplicated because a company would share the first word, or the first combination of words/letters - i.e. "A & A" vs "A & B". I understand this is based on how the JW formula works, but I don't quite know how to modify it enough.

I need to match values in Column b to Column a. There may be duplicates and Column a. I don't have any specific rules for similarity; I want the closest match possible to each value, and a minimal number of false duplicates.

For starters, would there be an easier way to accomplish this within stringi?

Please advise, as I am unaware how to best tackle this problem moving forward. If further details are required, I'm happy to oblige. Thank you in advance.

Please provide reproducible sample data. And please be clear what rules you want for 'similarity'. For instance, you may want to distinguish between (i.e. separate) `A&A` and `A&B` but not distinguish between (i.e. group) `AA` and `AB` — CPak, Aug 28 '17 at 13:17
The current dataset is not balanced, could you include mix of true positive cases and false positive cases to test strength of various methods and use `dput` function to include data structure e.g. `dput(head(my_input_data,10))` — Silence Dogood, Aug 28 '17 at 14:03

R string-based matching of business names

0 Answers0