Idea is use fuzzy matching lib fuzzywuzzy
for ratio
of all combinations of Name
s by cross join by DataFrame.merge
and removed rows with same names in both columns by DataFrame.query
, also was added new column by lengths of data by Series.str.len
:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
Then filter rows by treshold and boolean indexing
. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax
with DataFrame.loc
and then DataFrame.set_index
for Series
:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
Last Series.map
by ID
and replace non matched values by original with Series.fillna
:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
EDIT: If there is more valid strings per ID
is is more complicated:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
First get fuzz.ratio
like in solution before:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
Then filter by threshold:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
But for more precision is necessary specify, what strings are valid in list L
, filter by list:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
Last merge
with left join to original and repalce missing values:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421