0

TL;DR I'd like to match two unequal columns where the values contain business names, and I've tried stringdist's amatch using Jaro-Winkler matching to get close, but not nearly close enough. I am wondering if stringi would be useful here - I just don't quite understand how to use it, excuse my being a noob. I wouldn't ask otherwise but I don't think I'll be able to figure it out myself in time.

For context, there are 2079 business names in one column and 1878 business names in a second column. Many of these contain the business structures as suffixes - i.e. LLC, Inc., INC., Co. etc. - so I trimmed them out with excel before going into R. The names were manually entered into both columns so there are human-entry error variations.

I used this formula:

amatch(match$sales, match$box, maxDist = .25, method =c("jw"), weight = c(d = 1, i = .9, s = .9, t = .9), p= .2, matchNA = FALSE, bt=.25)

I was able to get some results with this, but many matches were duplicated because a company would share the first word, or the first combination of words/letters - i.e. "A & A" vs "A & B". I understand this is based on how the JW formula works, but I don't quite know how to modify it enough.

I need to match values in Column b to Column a. There may be duplicates and Column a. I don't have any specific rules for similarity; I want the closest match possible to each value, and a minimal number of false duplicates.

For starters, would there be an easier way to accomplish this within stringi?

Please advise, as I am unaware how to best tackle this problem moving forward. If further details are required, I'm happy to oblige. Thank you in advance.

  • Please provide reproducible sample data. And please be clear what rules you want for 'similarity'. For instance, you may want to distinguish between (i.e. separate) `A&A` and `A&B` but not distinguish between (i.e. group) `AA` and `AB` – CPak Aug 28 '17 at 13:17
  • Okay, I'e done so above. – Amjad Talib Aug 28 '17 at 13:57
  • The current dataset is not balanced, could you include mix of true positive cases and false positive cases to test strength of various methods and use `dput` function to include data structure e.g. `dput(head(my_input_data,10))` – Silence Dogood Aug 28 '17 at 14:03
  • Is this balanced now? – Amjad Talib Aug 31 '17 at 15:50

0 Answers0