2

I have a dataset containing a list of company names, and a respective ID for them. There are multiple instances of each company, with some appearing differently. There is at least one instance of each company name that has an ID, but not all of them due to inconsistencies in the spellings. All of the companies are grouped together. The data looks something like this:

company_name                 id

T. Rowe Price Group
Group, T. Rowe Price         576
T. ROWE PRICE GROUP
Transatlantic, Inc           458
Transatlantic, Incorporated
Transatlantic, Inc           458

Is there a good method to match company names with missing ID's to their correct one?

mkh
  • 85
  • 3
  • Hi, I see you're new to SO. If you feel an answer solved the problem, please mark it as 'accepted' by clicking the green check mark. This helps keep the focus on older SO which still don't have answers. – Sameeresque Feb 07 '20 at 04:53

2 Answers2

1

Here's one way to do it using pandas:

import pandas as pd
import numpy as np
import re
from collections import OrderedDict
# a function that splits a string into text and number
def my_splitter(s):
    return filter(None, re.split(r'(\d+)', s))
#reading the data as a dataframe from the file
df=pd.read_csv('dataset.txt',sep='\t',header=None,skiprows=1,names=['Name'])
join=[]
for i in range(len(df)):
    if len(my_splitter(df['Name'][i]))!=2:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': 'na'})
    else:
        join.append({'Name': my_splitter(df['Name'][i])[0], 'ID': my_splitter(df['Name'][i])[1]})
df_new=pd.DataFrame(join) 

diction=OrderedDict()
#creating a dictionary that stores the company name and ID
for i in range(len(df_new)):
    if df_new['ID'][i]!='na':
        diction[df_new['ID'][i]]=df_new['Name'][i].split()

for i in range(len(df_new)):
    if df_new['ID'][i]=='na':
        for j in diction:
            if bool(set(df_new['Name'][i].split()) & set(diction[j])):
                df_new['ID'][i]=j

print (df) # contents of the testing file read as a dataframe
print ("####################")
print (df_new)
#save result to a file - dataset.txt
df_new.to_csv('dataset.txt', sep='\t')

Output:

                              Name
0               T. Rowe Price Group
1  Group, T. Rowe Price         576
2               T. ROWE PRICE GROUP
3  Transatlantic, Inc           458
4       Transatlantic, Incorporated
5  Transatlantic, Inc           458
####################
    ID                           Name
0  576            T. Rowe Price Group
1  576  Group, T. Rowe Price         
2  576            T. ROWE PRICE GROUP
3  458  Transatlantic, Inc           
4  458    Transatlantic, Incorporated
5  458  Transatlantic, Inc   
Sameeresque
  • 2,464
  • 1
  • 9
  • 22
0

Using NLTK you can transform the company_names to their root (look up stemming and lemmatization example from here https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) then you can give same id for same company.

Souha Gaaloul
  • 328
  • 4
  • 9