0

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some suggestions on this would help.

Below is a sample dummy data along with code that I have created to explain the situation

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
df1 <- data.table(Address1)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
df2 <- data.table(Address2)

df1[, key_match := gsub("[^[:alnum:]]", "", Address1)]
df2[, key_match := gsub("[^[:alnum:]]", "", Address2)]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(key_match)
       , address_match := 
      fn_match(key_match, df2$key_match,3)
       ]

If you see the output it gives me the matches under address_match in df1. If I apply the same code on my main data, the code is still running from last 30 hours. Though I have converted to data.table. Not sure how I can speed this up.

I was doing further reading and came across stringdist matrix. This seems to be more helpful and I can split the address basis the space and check for presence of each word in each address list and depending upon the maximum match one can create the summary of matches. However I am not very good at loops. How do I loop thru each address from the smaller file for each word and check in individual address in the larger file and create matrix of matches? Any help!!

user1412
  • 709
  • 1
  • 8
  • 25
  • Perhaps could you start by normalizing addresses by splitting numbers and street name and then use stringdist library to do fuzzy match Here is an example : https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/ Tell me if it helped – Slim Feb 27 '17 at 13:03
  • @Slim, I had seen this link, but as a beginner in R, finding it difficult to implement in my case of address matching. Some solution would help! – user1412 Mar 01 '17 at 09:57

1 Answers1

0

I have a solution that does not require data.table but if the set is huge could run with package:parallel

 rbind.pages(
  parallel::mclapply(Address1, function(i){
    data.frame(
       src = i, 
       match = Address2[which.min(adist(i, Address2))]
     )
   }, mc.cores = parallel::detectCores() - 2)) %>% 
 select(`src (Address1)`= 1, `match (Address2)` = 2)

Which then gives the output solution:

                          src (Address1)                     match (Address2)
1                    786, GALI NO 5, XYZ                   786, GALI NO 4 XYZ
2       rambo, 45, strret 4, atlast, pqr del, 546, strret2, towards east, pqr
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR                  23/4, STREET 2, PQR
4                    45-B, GALI NO5, XYZ                  45B, GALI NO 5, XYZ
5                 HECTIC, 99 STREET, PQR                  23/4, STREET 2, PQR

Edit:

I realized that this may not be very helpful without seeing the distance computations so that you may tweak for your needs ; so I replicated the data into larger random sets and then amended the function to show the string distance computations and the processing time

rand_addy_one <- rep(Address1, 1000)[sample(1:1000, 1000)]
rand_addy_two <- rep(Address2, 3000)[sample(1:3000, 3000)]


system.time({
  test_one <<- rbind.pages(parallel::mclapply(rand_addy_one, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_two, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_two) %>% 
      filter(totals == min(totals))
  }, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address1` = src, `target Address2(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 24.940   1.480   3.384 

> nrow(test_one)
[1] 600000

Now to reverse and apply the larger set to the smaller:

system.time({
   test_two <<- rbind.pages(parallel::mclapply(rand_addy_two, function(i) {
    calc <- as.data.frame(drop(attr(adist(i, rand_addy_one, counts = TRUE), "counts")))
    calc$totals <- (rowSums(calc))
    calc %>% mutate(src = i, target = rand_addy_one) %>% 
        filter(totals == min(totals))
}, mc.cores = parallel::detectCores() - 2))  %>% 
    select(`source Address2` = src, `target Address1(matched)` = target,
           insertions = ins, deletions = del, substitutions = sub,
           total_approx_dist = totals)
})

   user  system elapsed 
 27.512   1.280   4.077 

nrow(test_two)
[1] 720000
Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • thank you for your solution. I tried replicating the code for a smaller data set and observed that for a single address it returns multiple matches along with the distance. By this if the data has 2000 records to match it returns a data of around 5000 with different matches. I was wondering how can we get back only the match with least distance, OR least 2 distance match. – user1412 Mar 02 '17 at 10:09