Split data by number of rows, fuzzy match with another dataset and then join all fuzzy matches together in R

Question

I'm trying to fuzzy match rows across different dataframes/datatables, based on the name of the company variable. I've matched a big chunk of these through standard joins and the use of some regex to remove words (such as Limited, Ltd) etc., but I'm left with a number that I'd like to match by fuzzy joining (e.g. if one dataset has a spelling mistake that I can't catch otherwise).

The datasets I'm trying to join are very big, so I need to split the smallest into more manageable chunks (e.g. 200 rows), then ideally I'd fuzzy match them (keeping only the closest matches), then combine all the answers. For example, if I have the following two datasets (much reduced in size):

company_names1 <- c("ABC Company", "Jelly Jelly (fka Jelly Jelly)", "Forest Rovers", "Table Manufacturers")
company_revenue <- c(200, 100, 84, 940)
company_profit <- (180, 9, -40, 450)
df1 <- c(company_names1, company_revenue, company_profit)
company_names2 <- c("ABCs Company", "Cars etc", "Jelly (Jelly) Company", "Forest Green Rovers", "Palm Trees for all inc", "Terrific Turtles", "Big Table Manufacturers")
city <- c("Berlin", "London", "Paris", "London", "San Diego", "Glasgow", "Denver")
df2 <- c(company_names2, city)

In this instance, I'd like split df1 into a smaller number of rows, then try and fuzzy match company_names1 and company_names2. I've copied the code I know below, but any other better ways of doing this would be welcome!

join <- expand.grid(name1 = df1$company_names1, name2 = df2$company_names2, stringsAsFactors = FALSE)
join <- join %>% group_by(name1) %>%
        mutate(match_score = stringdist(name1, name2, method='jw')) %>%
        arrange(name1, match_score) %>%
        slice(1) %>%
        ungroup %>%
        filter(match_score < 0.2 & match_score > 0)

(I've tried to use the fuzzyjoin package but it seems to be slower for me...)

Is there a way of using a function to split df1 into smaller dataframes based on the number of rows, do the above, and then combine/rbind the result of all of these 'join' dataframes into one dataframe in one go?

Please let me know if I can provide any other detail on this.

What are your N's here, how many names (N1) do you need to fuzzy match and how many companies (N2) are you matching to? Regardless of how you structure it, that's N1*N2 number of combinations which is why it's slow. Splitting it will make it work, not sure it will increase your speed at all though. — Reeza, Dec 06 '21 at 17:33
I'm trying to match UK companies first, which is roughly 200,000 x 5,000 names (although doing it for other countries might mean more on both sides). The methods I know for fuzzy matching aren't working as it comes up with an error about allocating an object that big, or running out of space. So my hope was that splitting the 5,000 names would reduce the size of any matrix it has to make, filter it, and delete the non-matches first, avoiding the errors I've received. — thetimidshrew, Dec 07 '21 at 10:06
Yes, splitting will get rid of the out of space issues. I'm assuming this is the remainder and you've removed all the exact matches already? If not, that should be your first step. — Reeza, Dec 07 '21 at 15:42
That's right, this is the remainder as I've removed the exact matches/matches having removed common words (limited etc). — thetimidshrew, Dec 07 '21 at 17:23

Split data by number of rows, fuzzy match with another dataset and then join all fuzzy matches together in R

0 Answers0