3

I need to join two datasets and the only identifier in both are the company names. For example:

db1 <- tibble(
  Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
  var1 = 1:4
)

db2 <- tibble(
  Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
  var2 = 6:9
)

Obviously a straightforward dplyr::left_join() is not going to work. I tried the following which didn't work:

fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
  Company                                          var1 Name             var2
  <chr>                                           <int> <chr>           <int>
1 Bombardier Inc.                                     1 Bombardier Inc.     6
2 Honeywell Development Corp                          2 NA                 NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 NA                 NA
4 PepsiCo Canada ULC                                  4 NA                 NA

I made some progress by removing "nonessential" characters from the names:

db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
  'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
  'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
  ignore_case = T
)) %>% str_squish())

fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
  Company                                          var1 Company.alt           Name            var2 Name.alt 
  <chr>                                           <int> <chr>                 <chr>          <int> <chr>    
1 Bombardier Inc.                                     1 Bombardier            Bombardier In~     6 Bombardi~
2 Honeywell Development Corp                          2 Honeywell Development Honeywell Dev~     7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 Pepsi Bottling Pbgc   The Pepsi Bot~     8 Pepsi Bo~
4 PepsiCo Canada ULC                                  4 PepsiCo               NA                NA NA      

But this still left the last row unmatched. For greater clarity, the last row of Company.alt is PepsiCo which is not considered a fuzzy match with Name.alt's last row of PepsiCo Pcu.

Is there way to successfully left-join the two datasets?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
msn
  • 113
  • 3

2 Answers2

4

Try this:

We could join db1 db2 based on fuzzy string matching of their columns.

With max_dist we could define the Maximum distance to use for joining

See: ?stringdist_left_join

library(dplyr)
library(fuzzyjoin)

fuzzyjoin::stringdist_left_join(x=db1, y=db2, max_dist = .35, 
                                by=c('Company'='Name'), 
                                method = 'jaccard', 
                                distance_col = "dist")
  Company                                          var1 Name                                    var2  dist
  <chr>                                           <int> <chr>                                  <int> <dbl>
1 Bombardier Inc.                                     1 Bombardier Inc.                            6 0    
2 Honeywell Development Corp                          2 Honeywell Dev Corp                         7 0.133
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 The Pepsi Bottling Group (Canada), ULC     8 0.172
4 PepsiCo Canada ULC                                  4 PepsiCo Canada ULC (“Pcu”)                 9 0.316
TarJae
  • 72,363
  • 6
  • 19
  • 66
4

1) phonics The phonics package has numerous methods for approximate matching such as soundex. See the package documentation for other methods.

library(dplyr)
library(phonics)

db1s <- mutate(db1, s = soundex(Company, clean = FALSE))
db2s <- mutate(db2, s = soundex(Name, clean = FALSE))
left_join(db1s, db2s)

giving:

Joining, by = "s"
# A tibble: 4 x 5
  Company                                          var1 s     Name          var2
  <chr>                                           <int> <chr> <chr>        <int>
1 Bombardier Inc.                                     1 B516  Bombardier ~     6
2 Honeywell Development Corp                          2 H543  Honeywell D~     7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)     3 T112  The Pepsi B~     8
4 PepsiCo Canada ULC                                  4 P122  PepsiCo Can~     9

2) SQLite SQLite has a built-in soundex function.

library(sqldf)

sqldf("select *
  from db1
  left join db2 on soundex(Company) = soundex(Name)")

giving:

                                          Company var1                                   Name var2
1                                 Bombardier Inc.    1                        Bombardier Inc.    6
2                      Honeywell Development Corp    2                     Honeywell Dev Corp    7
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)    3 The Pepsi Bottling Group (Canada), ULC    8
4                              PepsiCo Canada ULC    4             PepsiCo Canada ULC (“Pcu”)    9
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is cool! Would increasing the `maxCodeLen` improve match precision, or is there an optimal `maxCodeLen` for matching? – msn Apr 28 '22 at 18:20
  • Presumably that is the case. Practically one would have to use trial and error with the various phonics algorithms and arguments to find which one(s) work best with your particular dataset. – G. Grothendieck Apr 29 '22 at 11:28