2

I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names:

Companylist <- data.frame(Companies=c('AMMINEX'))

This is my big list of company names that I open:

Biglist <- data.frame(name=c(Biglist[,]))

I put AMMINEX and the 5 million companies in one matrix:

Matches <- expand.grid(Companylist$Companies,Biglist$name.Companiesnames)

Change the column names:

names(Matches) <- c("Companies","CompaniesList")

I use the stringdist with the method cosine:

Matches$dist <- stringdist(Matches$Companies,Matches$CompaniesList, method="cosine")

I remove all distances that are above 0.2 to get rid of bad matches:

Matches_trimmed <- Matches[!(Matches$dist>0.2),]

I sort by the distance column so best matches appear on the top:

Matches_trimmed <- Matches_trimmed[with(Matches_trimmed, order(dist)), ]

As you can see here, the results are not very satisfactory: enter image description here

The first row is good, but then a bunch of bad matches appear before finally at the bottom I get the matches "AMMINEX AS" which are good.

This doesn't really work out for me. Is there any way I can improve this fuzzy matching or maybe use a different method for better results? Maybe a method that will look the order in which the letters appear in the strings?

phiver
  • 23,048
  • 14
  • 44
  • 56
WoeIs
  • 1,083
  • 1
  • 15
  • 25
  • 1
    with names it is better to use the jaro or jaro-winkler metric – phiver Mar 31 '18 at 09:52
  • @phiver Thank you for the suggestion. I tried "jw" in the methods and it indeed did work better. https://i.imgur.com/WoU4RvK.png However, it still seems to match names like "LAMINEX" and "ARMINE" with relative low distance values. I fear that once I add more companies to get matched, it will be difficult for me to determine the cut-off distance value to ensure that the useful matches are kept and as many bad matches as possible are tossed out. – WoeIs Mar 31 '18 at 10:15
  • Do you have other information on the companies you could use to help determine the probability of a match? Maybe worth looking into the RecordLinkage package, which would allow you to try to match with many covariates, and simplifies finding a threshold for a match with EM algorithm. – gfgm Mar 31 '18 at 10:22
  • @gfgm I've taken a look at the RecordLinkage package. Unfortunately it's not possible for me to find any information besides their names at this point. Is there anything in the RecordLinkage package that can perhaps match based on the order of the letters in the names? I'm assuming the jaro-winkler metric already does that? – WoeIs Mar 31 '18 at 10:48
  • @WoeIs, you could use "jw" with p = 0.1. That gives a penalty to the matching. But deciding what the best cut-off will be up to you. I suggest running over a few companies and see if below 0.1 is a good target. – phiver Mar 31 '18 at 11:02
  • You can always add columns that you feel represent evidence of a match, e.g. if you think a lot of the mistakes are of the form `"true_name garbage"` you can create an additional co-variate which is the first n letters of the company name and try for exact matches on the stemmed name in addition to fuzzy matches on the whole -- this would add weight to patterns like `"AMMINEX" -> "AMMINEX AS"` – gfgm Mar 31 '18 at 11:04
  • @phiver I might have to end up going that route, thanks! – WoeIs Mar 31 '18 at 11:32
  • @gfgm That sounds like an interesting procedure! Which command in the package would you suggest that I use for this? I've been going through the list of commands in the package but I'm having a bit of a difficult time telling which of them adds a co-variate. – WoeIs Mar 31 '18 at 11:33
  • @WoeIs its not in the package, you would need to do it manually in advance for the two dataframes you want to match, e.g. `df1$stem_n <- substr(df1$name, 1, n)` etc. – gfgm Mar 31 '18 at 11:42
  • @gfgm Thanks for the suggestion. Unfortunately, I'm not quite sure how to incorporate substr into my previous code. I'm still on a very beginner level so all I was able to do was to reduce the name of "AMMINEX" based on the properties I specified in substr. What seems to be the main problem for me though is that companies like "LAMINEX" still appear, and I can't seem to find any algorithm that can see the difference between AMMINEX and LAMINEX. – WoeIs Mar 31 '18 at 12:23
  • Try qgrams and Jaccard metrics in stringdist. These both use n-grams so the order of letters does matter. You can ‘tune’ the value of n to find the best for your problem, but I have found that 2 works well. – Relasta Mar 31 '18 at 12:36
  • @Relasta Unfortunately those metrics didn't quite help either. – WoeIs Mar 31 '18 at 14:14

0 Answers0