I'm trying to fuzzy match rows across different dataframes/datatables, based on the name of the company variable. I've matched a big chunk of these through standard joins and the use of some regex to remove words (such as Limited, Ltd) etc., but I'm left with a number that I'd like to match by fuzzy joining (e.g. if one dataset has a spelling mistake that I can't catch otherwise).
The datasets I'm trying to join are very big, so I need to split the smallest into more manageable chunks (e.g. 200 rows), then ideally I'd fuzzy match them (keeping only the closest matches), then combine all the answers. For example, if I have the following two datasets (much reduced in size):
company_names1 <- c("ABC Company", "Jelly Jelly (fka Jelly Jelly)", "Forest Rovers", "Table Manufacturers")
company_revenue <- c(200, 100, 84, 940)
company_profit <- (180, 9, -40, 450)
df1 <- c(company_names1, company_revenue, company_profit)
company_names2 <- c("ABCs Company", "Cars etc", "Jelly (Jelly) Company", "Forest Green Rovers", "Palm Trees for all inc", "Terrific Turtles", "Big Table Manufacturers")
city <- c("Berlin", "London", "Paris", "London", "San Diego", "Glasgow", "Denver")
df2 <- c(company_names2, city)
In this instance, I'd like split df1 into a smaller number of rows, then try and fuzzy match company_names1 and company_names2. I've copied the code I know below, but any other better ways of doing this would be welcome!
join <- expand.grid(name1 = df1$company_names1, name2 = df2$company_names2, stringsAsFactors = FALSE)
join <- join %>% group_by(name1) %>%
mutate(match_score = stringdist(name1, name2, method='jw')) %>%
arrange(name1, match_score) %>%
slice(1) %>%
ungroup %>%
filter(match_score < 0.2 & match_score > 0)
(I've tried to use the fuzzyjoin package but it seems to be slower for me...)
Is there a way of using a function to split df1 into smaller dataframes based on the number of rows, do the above, and then combine/rbind the result of all of these 'join' dataframes into one dataframe in one go?
Please let me know if I can provide any other detail on this.