Fuzzy matching and iteration through DataFrame

Question

I have these two DataFrames: I want to fuzzy match the Surname strings to the corresponding Names

dico = {'Name': ['Arthur','Henri','Lisiane','Patrice'],
        "Age": ["20","18","62","73"],
        "Studies": ['Economics','Maths','Psychology','Medical']
             }
dico2 = {'Surname': ['Henri2','Arthur1','Patrice4','Lisiane3']}

dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)

I want to fuzzy match the Surname strings to the corresponding Names to have an output as follows

      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Patrice4  73     Medical

and here is my code so far:

dico['Surname'] = []
for i in dico2:
    lst = [0, 0, 0]
    for j in dico:
        if lst[0] < fuzz.ratio(i,j):
            lst[0] = fuzz.ratio(i,j)
            lst[1] = i
            lst[2] = j
    dico['Surname'].append(i)

but i get a ValueError: Length of values (0) does not match length of index (4), which I don't get why. Thanks !

A bit unrelated, but you should not calculate the fuzz.ratio twice, since it is slow. Calculate it once in front of the if statement and store it in a variable ;) — maxbachmann, Mar 22 '21 at 22:10

JuliettVictor · Answer 1 · 2021-03-23T03:58:44.517

The error originates from

dico['Surname'] = []

dico['Surname'] is length 4, while [] is length 0. You can instead collect your surnames in a list and then add the surnames to the dataframe in one go after the loop.

You also need to tell the outer loop to iterate over dico2['Surname'] instead of the entire dataframe.

surnames = []
for i in dico2['Surname']:
    lst = [0, 0, 0]
    for j in dico:
        if lst[0] < fuzz.ratio(i,j):
            lst[0] = fuzz.ratio(i,j)
            lst[1] = i
            lst[2] = j
    surnames.append(i)
    
dico['Surname'] = surnames

EDIT: only fixed the error in question. Also see maxbachmann's advise on not calling fuzz.ratio twice.

score 0 · Answer 2 · answered Mar 22 '21 at 20:56

dico = {'Name': ['Arthur','Henri','Lisiane','Patrice'],
        "Age": ["20","18","62","73"],
        "Studies": ['Economics','Maths','Psychology','Medical']
             }
dico2 = {'Surname': ['Henri2','Arthur1','Patrice4','Lisiane3']}

dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)

temp = pd.DataFrame()

for x in range (0, len(dico.Name)):
    name_str = dico.Name[x]
    temp = pd.concat([temp, dico2[dico2.Surname.str.contains(name_str)].Surname])

temp.columns=['Surname']

temp = temp.reset_index(drop = True)

dico = pd.concat([dico, temp], axis=1)

score 0 · Answer 3 · answered Mar 22 '21 at 22:37

Solution

map_list = []
for name in dico['Name']:
    best_ratio = None
    for idx, surname in enumerate(dico2['Surname']):
        if best_ratio == None:
            best_ratio = fuzz.ratio(name, surname)
            best_idx = 0
        else:
            ratio = fuzz.ratio(name, surname)
            if  ratio > best_ratio:
                best_ratio = ratio
                best_idx = idx
    map_list.append(dico2['Surname'][best_idx]) # obtain surname

dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns

Fuzzy matching and iteration through DataFrame

3 Answers3

Solution

Result