0

I have an issue joining two datasets with companies that have inconsistent names. In the first dataset, I have event data which consist of Company Name, Company ID, etc. In the second dataset, I have all the information about the company, including the Company Name and Company ID.

Example of the issue:

Event Data example:

Company Name Company ID
NokiaCorporation
Apple, INC. 1234567899

Company Data (the "golden" data) example:

Company Name Company ID
Nokia Oyj 2222222222
Apple INC 1111111111

As one can see from the Event Data example, in the second row, the company is Nokia but written "incorrectly" and has black cell in the Company ID due to this. Also there are some cases where the name is also correct, but not quite as in the third row for Apple case. In addition, the ID might be different due to this.

My question is:

  • How to tackle this issue? Are there any algorithms processors I can utilize?

I have utilized fuzzy matching method in Excel. However, I have over 100k data to check. With similarity rate of 65%, it gives over 35k rows to manually check. With 85% less than 4k observations. The aim is to have achieve high matching rate efficiently with the least amount of manual work.

And no, I don't have any other data that I can match other than the Company Name at the moment.

Best scenario: To match all the data using Company Name.

Edit: Background: I'm studying the impact of patent litigation on firm value by conducting an event study. The scope is to study all public firms.

In the event data, in the plaintiff/defendant side, there might be subsidiary companies. For example: Company Name: Nokia Technologies (subsidiary of Nokia Oyj) and different ID than Nokia Oyj. The optimal scenario would be to match these as well since subsidiaries' patent litigation's effect flows to the parent company (who can be public).

Edit 2x:

Extended view on the event data:

Plaintiff Name Plaintiff ID
NokiaCorporation XXXXXXXXXXXXXX
Nokia Technologies XXXXXXXXXXXXXX
Apple, INC. XXXXXXXXXXXXXX
Nokia Technologies XXXXXXXXXXXXXX
Google, Inc. XXXXXXXXXXXXXX
AT & T Corp. XXXXXXXXXXXXXX
Qualcomm Incorporated XXXXXXXXXXXXXX
eBay, Inc. XXXXXXXXXXXXXX
Dell, inc. XXXXXXXXXXXXXX
3M Innovation Limited XXXXXXXXXXXXXX

Extended view on the "golden" data:

Company Name Company ID
Nokia Oyj XXXXXXXXXXXXXX
Alphabet INC. XXXXXXXXXXXXXX
Apple INC. XXXXXXXXXXXXXX
AT&T INC. XXXXXXXXXXXXXX
Dell Technologies INC. XXXXXXXXXXXXXX
AT & T Corp. XXXXXXXXXXXXXX
Qualcomm INC XXXXXXXXXXXXXX
EBAY Inc XXXXXXXXXXXXXX
3M Company XXXXXXXXXXXXXX

Expected data after fuzzy matching or alternative matching process:

Plaintiff Name Company Name
NokiaCorporation Nokia Oyj
Nokia Technologies Nokia Oyj
Apple, INC. Apple INC.
Nokia Technologies Nokia Oyj
Google, Inc. Alphabet INC.
AT & T Corp. AT&T INC.
Qualcomm Incorporated Qualcomm INC
eBay, Inc. EBAY Inc
Dell, inc. Dell Technologies INC.
3M Innovation Limited 3M Company

The first column is from the event data and the second column is from the "golden" data.

Hao Pan
  • 1
  • 1

2 Answers2

1

A base R approach using adist and merge. This is guaranteed to give a match but fails when companies are very similar or are subbranches like "Apple Silicon", "Apple Broadcasting" etc.

merge(df1, 
  cbind(df2, adist = df1$`Company Name`[sapply(df1$`Company Name`, function(x) 
    which.min(adist(x, df2$`Company Name`, partial = T)))]), 
  by.x = "Company Name", by.y = "adist", all = T)
      Company Name Company ID.x Company Name.y Company ID.y
1      Apple, INC.   1234567899      Apple INC   1111111111
2 NokiaCorporation           NA      Nokia Oyj   2222222222

Data

df1 <- structure(list(`Company Name` = c("NokiaCorporation", "Apple, INC."
), `Company ID` = c(NA, 1234567899L)), row.names = c(NA, -2L), 
class = "data.frame")

df2 <- structure(list(`Company Name` = c("Nokia Oyj", "Apple INC"), 
    `Company ID` = c(2222222222, 1111111111)), 
class = "data.frame", row.names = c(NA, 
-2L))

--EDIT-- using new data

First using fuzzyjoins stringdist_left_join on unmodified df1 and df2

Matches: 4/10

library(fuzzyjoin)

stringdist_left_join(df1, df2, by=c("Plaintiff Name" = "Company Name"), 
  ignore_case=T, method = "jw", max_dist = .2)[,c("Plaintiff Name", "Company Name")]
          Plaintiff Name Company Name
1       NokiaCorporation         <NA>
2     Nokia Technologies         <NA>
3            Apple, INC.   Apple INC.
4     Nokia Technologies         <NA>
5           Google, Inc.         <NA>
6           AT & T Corp. AT & T Corp.
7  Qualcomm Incorporated Qualcomm INC
8             eBay, Inc.     EBAY Inc
9             Dell, inc.         <NA>
10 3M Innovation Limited         <NA>

Second using stringdist_left_join on modified df1 df2.

Modification means removing long redundant generic strings like "Technologies" and "Company" from both sets and putting the results in a temporary column Ctemp.

Matches: 7/10

library(fuzzyjoin)

df1$Ctemp <- gsub("Technologies|Company|Limited", "", df1$`Plaintiff Name`)
df2$Ctemp <- gsub("Technologies|Company|Limited", "", df2$`Company Name`)

stringdist_left_join(df1, df2, by="Ctemp", ignore_case=T, method = "jw", 
  max_dist = .2)[,c("Plaintiff Name", "Company Name")]
          Plaintiff Name           Company Name
1       NokiaCorporation                   <NA>
2     Nokia Technologies              Nokia Oyj
3            Apple, INC.             Apple INC.
4     Nokia Technologies              Nokia Oyj
5           Google, Inc.                   <NA>
6           AT & T Corp.           AT & T Corp.
7  Qualcomm Incorporated           Qualcomm INC
8             eBay, Inc.               EBAY Inc
9             Dell, inc. Dell Technologies INC.
10 3M Innovation Limited                   <NA>

Third, using adist and a cost list of list(insertions=8, deletions=11, substitutions=7)

Matches: 9/10 (one mismatch, and "Google, Inc." | "Alphabet INC." match because of "Inc." and length of prefixes)

df1$Ctemp <- gsub("Technologies|Company|Limited", "", df1$`Plaintiff Name`)
df2$Ctemp <- gsub("Technologies|Company|Limited", "", df2$`Company Name`)

df1$adist <- sapply(df1$Ctemp, function(plaint) 
  df2$Ctemp[which.min(adist(plaint, df2$Ctemp, partial = T, 
    ignore.case = T, costs=list(ins=8, del=11, sub=7)))])

merge(df1, df2, by.x="adist", by.y="Ctemp")[, c("Plaintiff Name", "Company Name")]
          Plaintiff Name           Company Name
1           Google, Inc.          Alphabet INC.
2            Apple, INC.             Apple INC.
3           AT & T Corp.           AT & T Corp.
4             Dell, inc. Dell Technologies INC.
5             eBay, Inc.               EBAY Inc
6       NokiaCorporation              Nokia Oyj
7     Nokia Technologies              Nokia Oyj
8     Nokia Technologies              Nokia Oyj
9  Qualcomm Incorporated           Qualcomm INC
10 3M Innovation Limited           Qualcomm INC

Conclusion:

stringdist_left_join can be tweaked to deliver good results by choosing an appropriate method, e.g. jw with max_dist 0.2 and reducing the redundancy within the search strings.

adist can be tweaked further by fine tuning costs insertions, deletions and substitutions. Always matches but can include mismatches.

extended data

df1 <- structure(list(`Plaintiff Name` = c("NokiaCorporation", 
"Nokia Technologies", "Apple, INC.", "Nokia Technologies", 
"Google, Inc.", "AT & T Corp.", 
"Qualcomm Incorporated", "eBay, Inc.", "Dell, inc.", 
"3M Innovation Limited"
), `Plaintiff ID` = c("XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", 
"XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", 
"XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX")), 
class = "data.frame", row.names = c(NA, -10L))

df2 <- structure(list(`Company Name` = c("Nokia Oyj", "Alphabet INC.", 
"Apple INC.", "AT&T INC.", "Dell Technologies INC.", "AT & T Corp.", 
"Qualcomm INC", "EBAY Inc", "3M Company"), `Company ID` = c("XXXXXXXXXXXXXX", 
"XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", 
"XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX"
)), class = "data.frame", row.names = c(NA, -9L))
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • Thank you! There might be subsidiaries with similar name than, for example, Apple which still need to be matched. To clarity, I need to match all the companies or its global ultimate owner who are public or have been public. Thus, if the companies global ultimate owner is public, then the company should be matched. – Hao Pan Jan 31 '23 at 21:50
  • @HaoPan Hmm, maybe it would be best to extend your example through edit to show that scenario. Otherwise it's gonna be difficult to come up with a solution. – Andre Wildberg Jan 31 '23 at 22:03
  • I've extended my example and given a little bit of background. – Hao Pan Jan 31 '23 at 22:27
  • @HaoPan Thanks, but thats not enough info. Ideally you'd show a `dput(head(data, 10))` or an extended table view of the data and an expected output. Especially in a fuzzy match scenario its quite important to have as much data as possible (but still minimal) to test on, otherwise the given solution will not work or perform poorly. – Andre Wildberg Jan 31 '23 at 22:41
  • Edited an extended table view and an expected output. – Hao Pan Feb 01 '23 at 08:49
  • @HaoPan Added an overview of possibilities. Should give you a good start to get reasonable results. – Andre Wildberg Feb 01 '23 at 15:29
  • 1
    Thank you so much for your effort and input! – Hao Pan Feb 01 '23 at 22:33
0

You can use {powerjoin} with {stringdist} :

powerjoin::power_left_join(
  df1, df2, 
  by = ~ stringdist::stringdist(.x$`Plaintiff Name`, .y$`Company Name`) < 7
)
#>           Plaintiff Name   Plaintiff ID  Company Name     Company ID
#> 1       NokiaCorporation XXXXXXXXXXXXXX          <NA>           <NA>
#> 2     Nokia Technologies XXXXXXXXXXXXXX          <NA>           <NA>
#> 3            Apple, INC. XXXXXXXXXXXXXX Alphabet INC. XXXXXXXXXXXXXX
#> 4            Apple, INC. XXXXXXXXXXXXXX    Apple INC. XXXXXXXXXXXXXX
#> 5            Apple, INC. XXXXXXXXXXXXXX     AT&T INC. XXXXXXXXXXXXXX
#> 6     Nokia Technologies XXXXXXXXXXXXXX          <NA>           <NA>
#> 7           Google, Inc. XXXXXXXXXXXXXX          <NA>           <NA>
#> 8           AT & T Corp. XXXXXXXXXXXXXX     AT&T INC. XXXXXXXXXXXXXX
#> 9           AT & T Corp. XXXXXXXXXXXXXX  AT & T Corp. XXXXXXXXXXXXXX
#> 10 Qualcomm Incorporated XXXXXXXXXXXXXX          <NA>           <NA>
#> 11            eBay, Inc. XXXXXXXXXXXXXX      EBAY Inc XXXXXXXXXXXXXX
#> 12            Dell, inc. XXXXXXXXXXXXXX          <NA>           <NA>
#> 13 3M Innovation Limited XXXXXXXXXXXXXX          <NA>           <NA>

You might also capture the distance to disambiguate further, or play with the parameters of the stringdist() function :

powerjoin::power_left_join(
  df1, df2, 
  by = ~ (dist <- stringdist::stringdist(.x$`Plaintiff Name`, .y$`Company Name`)) < 8
)
#>           Plaintiff Name   Plaintiff ID  Company Name     Company ID dist
#> 1       NokiaCorporation XXXXXXXXXXXXXX          <NA>           <NA>   NA
#> 2     Nokia Technologies XXXXXXXXXXXXXX          <NA>           <NA>   NA
#> 3            Apple, INC. XXXXXXXXXXXXXX Alphabet INC. XXXXXXXXXXXXXX    5
#> 4            Apple, INC. XXXXXXXXXXXXXX    Apple INC. XXXXXXXXXXXXXX    1
#> 5            Apple, INC. XXXXXXXXXXXXXX     AT&T INC. XXXXXXXXXXXXXX    5
#> 6     Nokia Technologies XXXXXXXXXXXXXX          <NA>           <NA>   NA
#> 7           Google, Inc. XXXXXXXXXXXXXX    Apple INC. XXXXXXXXXXXXXX    7
#> 8           AT & T Corp. XXXXXXXXXXXXXX     AT&T INC. XXXXXXXXXXXXXX    6
#> 9           AT & T Corp. XXXXXXXXXXXXXX  AT & T Corp. XXXXXXXXXXXXXX    0
#> 10 Qualcomm Incorporated XXXXXXXXXXXXXX          <NA>           <NA>   NA
#> 11            eBay, Inc. XXXXXXXXXXXXXX    Apple INC. XXXXXXXXXXXXXX    7
#> 12            eBay, Inc. XXXXXXXXXXXXXX     AT&T INC. XXXXXXXXXXXXXX    7
#> 13            eBay, Inc. XXXXXXXXXXXXXX      EBAY Inc XXXXXXXXXXXXXX    5
#> 14            Dell, inc. XXXXXXXXXXXXXX    Apple INC. XXXXXXXXXXXXXX    7
#> 15            Dell, inc. XXXXXXXXXXXXXX      EBAY Inc XXXXXXXXXXXXXX    7
#> 16 3M Innovation Limited XXXXXXXXXXXXXX          <NA>           <NA>   NA

Created on 2023-03-16 with reprex v2.0.2

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167