Identifying near duplicate keywords and replacing them

Question

I have a dataframe like as shown below

ID,Name,year,output
1,Test Level,2021,1
2,Test Lvele,2022,1
2,dummy Inc,2022,1
2,dummy Pvt Inc,2022,1
3,dasho Ltd,2022,1
4,dasho PVT Ltd,2021,0
5,delphi Ltd,2021,1
6,delphi pvt ltd,2021,1

df = pd.read_clipboard(sep=',')

My objective is

a) To replace near duplicate strings using a common string.

For example - let's pick couple of strings from Name column. We have dummy Inc and dummy Pvt Inc. These both have to be replaced as dummy

I manually prepared a mapping df map_df like as below (but can't do this for big data)

  Name,correct_name
  Test Level,Test
  Test Lvele,Test
  dummy Inc,dummy
  dummy Pvt Inc,dummy
  dasho Ltd,dasho
  dasho PVT Ltd,dasho
  delphi Ltd,delphi
  delphi pvt ltd,delphi

So, I tried the below

map_df = map_df.set_index(Name)
df['Name'] = df['Name'].map(map_df) # but this doesn't work and throws error

Is creating mapping table the only way or is there any NLP based approach?

I expect my output to be like as below

ID,Name,year,output
1,Test,2021,1
2,Test,2022,1
2,dummy,2022,1
2,dummy,2022,1
3,dasho,2022,1
4,dasho,2021,0
5,delphi,2021,1
6,delphi,2021,1

score 2 · Answer 1 · edited May 31 '22 at 06:03

2

You can try this:

df['Name'] = [name.split()[0] for name in df['Name']]

edited May 31 '22 at 06:03

Sunderam Dubey

1
11
20
40

answered May 11 '22 at 07:24

William Rosenbaum

255
2
7

can you guide me on how can I apply the map function as shown in my code? upvoted for the help – The Great May 11 '22 at 08:22
Sorry, I can't come up with anything good using map. – William Rosenbaum May 11 '22 at 08:50

Stef · Accepted Answer · 2022-05-12T09:09:32.697

I suggest using a dict instead of a pandas.DataFrame for map_df.

ID,Name,year,output
1,Test Level,2021,1
2,Test Lvele,2022,1
2,dummy Inc,2022,1
2,dummy Pvt Inc,2022,1
3,dasho Ltd,2022,1
4,dasho PVT Ltd,2021,0
5,delphi Ltd,2021,1
6,delphi pvt ltd,2021,1

df = pd.read_clipboard(sep=',')

map_dict = dict(s.strip().split(',') for s in '''  Test Level,Test
  Test Lvele,Test
  dummy Inc,dummy
  dummy Pvt Inc,dummy
  dasho Ltd,dasho
  dasho PVT Ltd,dasho
  delphi Ltd,delphi
  delphi pvt ltd,delphi'''.split('\n'))

df['Name'] = df['Name'].map(map_dict.get)

Results:

df.to_clipboard(sep=',')

,ID,Name,year,output
0,1,Test,2021,1
1,2,Test,2022,1
2,2,dummy,2022,1
3,2,dummy,2022,1
4,3,dasho,2022,1
5,4,dasho,2021,0
6,5,delphi,2021,1
7,6,delphi,2021,1

If map_df is already a DataFrame with two columns and you want to turn these two columns into a dict, this related question: How to create a dictionary of two pandas DataFrame columns? suggests a few methods:

map_dict = dict(zip(map_df['Name'], map_df['correct_name']))

map_dict = pd.Series(map_df['correct_name'].values,index=map_df['Name']).to_dict()

map_dict = map_df.set_index('Name').to_dict()['correct_name']

map_dict = dict(map_df.to_records(index=False))

Identifying near duplicate keywords and replacing them

2 Answers2