-1

i have a dataframe with company names

df:

company_name
abc Inc
abc Inc Bolingbrook
enterprise badh Shah
enterprise Financial
enterprise Financial Shah
bass Dance
bass School of Dance
david Warner
david Warner Real Estate Inc
david Warneranita sampath
Dr anitha sampath
Dranil kumar Gyan prasad
Dranil and kumar Mortgage Corporation
Drbadh Shah
Drvenky Patel
Drs krishna and Rama lingam

i want to standardize the company_name so that the output looks like this

output df:

company_name standardized_company_name
abc Inc abc Inc
abc Inc Bolingbrook abc Inc
enterprise badh Shah enterprise Financial
enterprise Financial enterprise Financial
enterprise Financial Shah enterprise Financial
bass Dance bass School of Dance
bass School of Dance bass School of Dance
david Warner david Warner
david Warner Real Estate Inc david Warner
david Warneranita sampath david Warner
Dr anitha sampath anitha sampath
Dranil kumar Gyan prasad anil kumar
Dranil and Gyan Mortgage Corporation anil kumar
Drbadh Shah badh Shah
Drvenky Patel venky Patel
Drs krishna and Rama lingam krishna and Rama lingam

NOTE: the standardization has no rules but similar company_names should have same standardized_company_name

for eg: standardized_company_name can also be like this

company_name standardized_company_name
abc Inc abc
abc Inc Bolingbrook abc
enterprise badh Shah enterprise
enterprise Financial enterprise
enterprise Financial Shah enterprise

i tried removing stopwords using regex replace but its not effective. Thanks in advance.............

i also tried splitting

def func(val):
    val=val.split(' ',2)
    return ' '.join([val[0]])

name = unique[['company_name','state']]
name['standardized_company_name']=name['company_name'].apply(func)

but what i get is

output i got :

company_name standardized_company_name
abc Inc abc
abc Inc Bolingbrook abc
enterprise badh Shah enterprise
enterprise Financial enterprise
enterprise Financial Shah enterprise
bass Dance bass
bass School of Dance bass
david Warner david
david Warner Real Estate Inc david
david Warneranita sampath david
Dr anitha sampath Dr
Dranil kumar Gyan prasad Dranil
Dranil and kumar Mortgage Corporation Dranil
Drbadh Shah Drbadh
Drvenky Patel Drvenky
Drs krishna and Rama lingam Drs

1 Answers1

2

Firstly create a function that do this for you:-

def func(val):
    val=val.split(' ',2)
    if len(val)==1:
        return val[0]
    else:
        return ' '.join([val[0],val[1]])

Now just make use of apply() method:-

df['standardized_company_name']=df['company_name'].apply(func)

Output:-

    company_name                     standardized_company_namedf
0   abc Inc                          abc Inc
1   abc Inc Bolingbrook              abc Inc
2   enterprise badh Shah             enterprise badh
3   enterprise Financial             enterprise Financial
4   enterprise Financial Shah        enterprise Financial
5   bass Dance                       bass Dance
6   bass School of Dance             bass School
7   david Warner                     david Warner
8   david Warner Real Estate Inc     david Warner
9   david Warneranita sampath        david Warneranita

then as you mention your condition in comment use this:-

df['standardized_company_name']=df['standardized_company_name'].str.replace('badh','Financial')
df['standardized_company_name']=df['standardized_company_name'].str.replace('bass Dance','bass School')
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • it has names like Dr abc Dr bcc etc in that case it will take all as Dr .? i will update my question.. –  Mar 12 '21 at 16:55
  • if company name has only word then ur logic breaks it returns(list index out of range) –  Mar 12 '21 at 17:40
  • ohh...fixed it and updated answer ...now kindly have a look – Anurag Dabas Mar 12 '21 at 17:45
  • but the standard names should be same.... for eg enterprise should either have standardname as enterprise or enterprise Financial not both similarly bass too either bass or bass school for all records. –  Mar 12 '21 at 17:51
  • ohh...fixed this as well and updated answer ....now kindly have a look – Anurag Dabas Mar 12 '21 at 18:03