I have written a script to do some fuzzy matching of company names. I'm matching a number of not-always-completely-correct company names (i.e. there might be small spelling mistakes or the "inc." suffix is missing) up against a corpus of "correct" company names and ID. Obviously the point is to correctly attach ID's to the not-always-correct company names.
Here are some grossly simplified version of the datasets I'm matching (I'm not using the zip-part yet, but will get back to it later):
df <- data.frame(zip = c("4760","5445", "2200"), company = c("company x", "company y", "company z"))
corpus <- data.frame(zip = c("4760","5445", "2200", "2200", "2200"), company = c("company x inc.", "company y inc.", "company z inc.", "company a inc.", "company b inc."), id = c(12121212, 23232323, 34343434, 56565656, 67676767))
df
zip company
1 4760 company x
2 5445 company y
3 2200 company z
corpus
zip company id
1 4760 company x inc. 12121212
2 5445 company y inc. 23232323
3 2200 company z inc. 34343434
4 2200 company a inc. 56565656
5 2200 company b inc. 67676767
I then use the following piece of code to create a matrix of string distance
library(stringdist)
distance.method <- c("jw")
string.dist.matrix <- stringdistmatrix(tolower(corpus$company),
tolower(df$company),
method = distance.method,
nthread = getOption("sd_num_thread"))
string.dist.matrix
[,1] [,2] [,3]
[1,] 0.1190476 0.1798942 0.1798942
[2,] 0.1798942 0.1190476 0.1798942
[3,] 0.1798942 0.1798942 0.1190476
[4,] 0.1798942 0.1798942 0.1798942
[5,] 0.1798942 0.1798942 0.1798942
I then go ahead and match up the pairs of minimum distance. Normally I want to match maybe 4000 companies up a against a corpus of 4,5 mio. companies, which takes some computing power to say the least. I had the idea that instead of calculating string distance between all possible pairs, I would instead only calculate it for those who share a zip code. As I see it the result would be a way smaller amount of calculations and even more precision in the fuzzy matching for more complex cases than the ones I've illustrated here with my simplified data.
In short the resulting matrix I would want would be something like this:
[,1] [,2] [,3]
[1,] 0.1190476 NA NA
[2,] NA 0.1190476 NA
[3,] NA NA 0.1190476
[4,] NA NA 0.1798942
[5,] NA NA 0.1798942
I just cant seem to figure out a way to do it. Any ideas?