I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some suggestions on this would help.
Below is a sample dummy data along with code that I have created to explain the situation
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
df1 <- data.table(Address1)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
df2 <- data.table(Address2)
df1[, key_match := gsub("[^[:alnum:]]", "", Address1)]
df2[, key_match := gsub("[^[:alnum:]]", "", Address2)]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(key_match)
, address_match :=
fn_match(key_match, df2$key_match,3)
]
If you see the output it gives me the matches under address_match in df1. If I apply the same code on my main data, the code is still running from last 30 hours. Though I have converted to data.table. Not sure how I can speed this up.
I was doing further reading and came across stringdist matrix. This seems to be more helpful and I can split the address basis the space and check for presence of each word in each address list and depending upon the maximum match one can create the summary of matches. However I am not very good at loops. How do I loop thru each address from the smaller file for each word and check in individual address in the larger file and create matrix of matches? Any help!!