I have an issue joining two datasets with companies that have inconsistent names. In the first dataset, I have event data which consist of Company Name, Company ID, etc. In the second dataset, I have all the information about the company, including the Company Name and Company ID.
Example of the issue:
Event Data example:
Company Name | Company ID |
---|---|
NokiaCorporation | |
Apple, INC. | 1234567899 |
Company Data (the "golden" data) example:
Company Name | Company ID |
---|---|
Nokia Oyj | 2222222222 |
Apple INC | 1111111111 |
As one can see from the Event Data example, in the second row, the company is Nokia but written "incorrectly" and has black cell in the Company ID due to this. Also there are some cases where the name is also correct, but not quite as in the third row for Apple case. In addition, the ID might be different due to this.
My question is:
- How to tackle this issue? Are there any algorithms processors I can utilize?
I have utilized fuzzy matching method in Excel. However, I have over 100k data to check. With similarity rate of 65%, it gives over 35k rows to manually check. With 85% less than 4k observations. The aim is to have achieve high matching rate efficiently with the least amount of manual work.
And no, I don't have any other data that I can match other than the Company Name at the moment.
Best scenario: To match all the data using Company Name.
Edit: Background: I'm studying the impact of patent litigation on firm value by conducting an event study. The scope is to study all public firms.
In the event data, in the plaintiff/defendant side, there might be subsidiary companies. For example: Company Name: Nokia Technologies (subsidiary of Nokia Oyj) and different ID than Nokia Oyj. The optimal scenario would be to match these as well since subsidiaries' patent litigation's effect flows to the parent company (who can be public).
Edit 2x:
Extended view on the event data:
Plaintiff Name | Plaintiff ID |
---|---|
NokiaCorporation | XXXXXXXXXXXXXX |
Nokia Technologies | XXXXXXXXXXXXXX |
Apple, INC. | XXXXXXXXXXXXXX |
Nokia Technologies | XXXXXXXXXXXXXX |
Google, Inc. | XXXXXXXXXXXXXX |
AT & T Corp. | XXXXXXXXXXXXXX |
Qualcomm Incorporated | XXXXXXXXXXXXXX |
eBay, Inc. | XXXXXXXXXXXXXX |
Dell, inc. | XXXXXXXXXXXXXX |
3M Innovation Limited | XXXXXXXXXXXXXX |
Extended view on the "golden" data:
Company Name | Company ID |
---|---|
Nokia Oyj | XXXXXXXXXXXXXX |
Alphabet INC. | XXXXXXXXXXXXXX |
Apple INC. | XXXXXXXXXXXXXX |
AT&T INC. | XXXXXXXXXXXXXX |
Dell Technologies INC. | XXXXXXXXXXXXXX |
AT & T Corp. | XXXXXXXXXXXXXX |
Qualcomm INC | XXXXXXXXXXXXXX |
EBAY Inc | XXXXXXXXXXXXXX |
3M Company | XXXXXXXXXXXXXX |
Expected data after fuzzy matching or alternative matching process:
Plaintiff Name | Company Name |
---|---|
NokiaCorporation | Nokia Oyj |
Nokia Technologies | Nokia Oyj |
Apple, INC. | Apple INC. |
Nokia Technologies | Nokia Oyj |
Google, Inc. | Alphabet INC. |
AT & T Corp. | AT&T INC. |
Qualcomm Incorporated | Qualcomm INC |
eBay, Inc. | EBAY Inc |
Dell, inc. | Dell Technologies INC. |
3M Innovation Limited | 3M Company |
The first column is from the event data and the second column is from the "golden" data.