3

I'm trying to inner join two datasets: df1 of 50,000 obs looks something like this:

  Name              | Line.1           | Line.2     | Town       | County       | Postcode 
 -------------------|------------------|------------|------------|--------------|---------- 
  ACME Inc          | 63 Long Street   |            | Fakeington | Lincolnshire | PA4 8QU  
  BETA LTD          | 91a              | Main Drove | Cloud City | Something    | BN1 6LD  
  The Giga          | 344 Lorem Street |            | Ipsom      | Dolor        | G2 8LY   

df2 of 500,000 obs looks like this:

  Name              | AddressLine1   | AddressLine2     | AddressLine3 | AddressLine4 | Postcode | RatingValue 
 -------------------|----------------|------------------|--------------|--------------|----------|------------- 
  ACME              |                | 63 Long Street   | Fakeington   | Lincolnshire | PA4 8QU  | 1           
  Random Company    |                | Rose Ave         | Fakeington   |              | AB2 51GL | 5           
  BETA Limited      | Business House | 91a Main Drove   | Something    |              | BN1 6LD  | 3           
  Giga Incorporated |                | 344 Lorem Street | Ipsum        | Dolor        | G2 8LY   | 5           

And I want to get to something like df_final.

  Name              | Postcode | RatingValue 
 -------------------|----------|------------- 
  ACME Inc          | PA4 8QU  | 1           
  BETA LTD          | BN1 6LD  | 3           
  Giga Incorporated | G2 8LY   | 5           

These are one-to-one matches and all the values in df1 should exist in df2. Postcode is an exact match while the address is split into multiple lines with no regular pattern, so I think my best bet is to match by Name.

I tried the fuzzyjoin package but I'm getting an Error: cannot allocate vector of size 120.6 Gb so I guess I have to use another method that works with bigger datasets.

Any ideas about what the best way to approach this is?

df1 <- data.frame(
  stringsAsFactors = FALSE,
              Name = c("ACME Inc", "BETA LTD", "Giga Incorporated"),
            Line.1 = c("63 Long Street", "91a", "344 Lorem Street"),
            Line.2 = c(NA, "Main Drove", NA),
              Town = c("Fakeington", "Cloud City", "Ipsom"),
            County = c("Lincolnshire", "Something", "Dolor"),
          Postcode = c("PA4 8QU", "BN1 6LD", "G2 8LY")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
              Name = c("ACME", "Random Company","BETA Limited","Giga Incorporated"),
      AddressLine1 = c(NA, NA, "Business House", NA),
      AddressLine2 = c("63 Long Street", "Rose Ave","91a Main Drove","344 Lorem Street"),
      AddressLine3 = c("Fakeington", "Fakeington", "Something", "Ipsum"),
      AddressLine4 = c("Lincolnshire", NA, NA, "Dolor"),
          Postcode = c("PA4 8QU", "AB2 51GL", "BN1 6LD", "G2 8LY"),
       RatingValue = c(1L, 5L, 3L, 5L)
)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    I would first break this down into a smaller problem. Make a dataframe from two existing columns: df1$Name and df2$Name, where the rows represent every combination of "matching" company names based on meeting some matching threshold value (soundex, string distance, whatever you want). Do a Cartesian join of df1 and df2, then filter such that you only keep only valid combinations. – Bill O'Brien Jul 31 '20 at 19:35
  • @BillO'Brien A Cartesian join? `df1` is 50K rows and `df2` 500K, `50000*500000` is `2.5e+10`. – Rui Barradas Jul 31 '20 at 19:50
  • I stand corrected. – Bill O'Brien Jul 31 '20 at 20:05

1 Answers1

4

Maybe something like the following will do what the question asks for. It uses package stringdist, not fuzzyjoin.

First, merge by Postcode only, since the matches are exact. Then get similarities between Name's. If they are above a predetermined threshold, keep those rows.

thresh <- 0.75

df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh

df_final <- df_final[i, c(2, 1, 3)]

df_final 
#               Name Postcode RatingValue
#1      BETA Limited  BN1 6LD           3
#2 Giga Incorporated   G2 8LY           5
#3              ACME  PA4 8QU           1
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    That seems to have done it. I'll do a bit of cleaning to remove stopwords and adjust the threshold and test different methods, but so far so good! – Nicu Calcea Jul 31 '20 at 20:48