2

I have a data frame that has a name and the URL ID of the name. For example:

Abc           123
Abc.com       123
Def           345
Pqr           123
PQR.com       123

Here due to data extraction error, at times different names have same ID. I want to clean the table such that if the names are different and ID is same, then the record should remain the same. If the names are similar and ID is also same, the name should be changed to one. To be clear,

The expected output should be:

Abc.com     123
Abc.com     123
Def         354
PQR.com     123
PQR.com     123

That is, the last one was data entry error..and both were the same name(The first word of the string was the same). So they are changed to one name looking at ID. But first and second records even though they had a similar ID to the last ones their names did not match and were completely different.

I am not able to understand how to achieve this.

Request some guidance here. Thanks in advance.

Note: The size of the dataset is almost 16 million of such records.

asspsss
  • 103
  • 1
  • 1
  • 8

1 Answers1

3

Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:

from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
    Name_x   ID   Name_y  ratio  len
1      Abc  123      BCD      0    3
2      BCD  123      Abc      0    3
6      Pqr  789  PQR.com     20    3
7  PQR.com  789      Pqr     20    7

Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:

N = 15    
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789    PQR.com
Name: Name_x, dtype: object

Last Series.map by ID and replace non matched values by original with Series.fillna:

df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
      Name   ID
0      Abc  123
1      BCD  123
2      Def  345
3  PQR.com  789
4  PQR.com  789

EDIT: If there is more valid strings per ID is is more complicated:

print (df)
               Name          ID
0      Air Ordnance  1578013421
1  Air-Ordnance.com  1578013421
2          Garreett  1578013421
3           Garrett  1578013421

First get fuzz.ratio like in solution before:

from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
2       Air Ordnance  1578013421          Garreett     30
3       Air Ordnance  1578013421           Garrett     32
4   Air-Ordnance.com  1578013421      Air Ordnance     79
6   Air-Ordnance.com  1578013421          Garreett     25
7   Air-Ordnance.com  1578013421           Garrett     26
8           Garreett  1578013421      Air Ordnance     30
9           Garreett  1578013421  Air-Ordnance.com     25
11          Garreett  1578013421           Garrett     93
12           Garrett  1578013421      Air Ordnance     32
13           Garrett  1578013421  Air-Ordnance.com     26
14           Garrett  1578013421          Garreett     93

Then filter by threshold:

N = 50    
df2 = df1[df1['ratio'].gt(N)]
print (df2)

              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
4   Air-Ordnance.com  1578013421      Air Ordnance     79
11          Garreett  1578013421           Garrett     93
14           Garrett  1578013421          Garreett     93

But for more precision is necessary specify, what strings are valid in list L, filter by list:

L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
              Name_x          Name          ID
4   Air-Ordnance.com  Air Ordnance  1578013421
14           Garrett      Garreett  1578013421

Last merge with left join to original and repalce missing values:

df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
               Name          ID
0  Air-Ordnance.com  1578013421
1  Air-Ordnance.com  1578013421
2           Garrett  1578013421
3           Garrett  1578013421
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Here the problem with my data is same ID is available for more than one name EG: Air Ordnance 1578013421 Air-Ordnance.com 1578013421 Garreett 1578013421 Garrett 1578013421 How do i address that.. I want first ones to be Air ordinance and others to be Garrett – asspsss Feb 17 '20 at 02:01
  • But the dataset contains more than 16 million such records. So we cannot manually specify the manual list. @jezrael – asspsss Feb 17 '20 at 16:26