1

I am attempting to merge two data frames on the column "Name". The names are relatively the same. The merge is not working because there are discrepancies between the names due to special characters, missing suffixes, and nicknames. Any ideas?

df1

Name Performance
Matt Cole Jr 89
Steve Smith 72
DJ Watson 91
Amar'e George 92
Steve Nurkić 84
Charles Long 87
Nic Kitch 73

df2

Name Year
Matt Cole Jr. 2023
Steve Smith III 2022
D.J. Watson 2021
Amare George 2020
Steve Nurkic 2021
Charles Long Jr. 2022
Nicolas Kitch 2023

Desired result:

Name Year Performance
Matt Cole Jr. 2023 89
Steve Smith III 2022 72
D.J. Watson 2021 91
Amare George 2020 92
Steve Nurkic 2021 84
Charles Long Jr. 2022 87
Nicolas Kitch 2023 73

I was thinking this could work but the code was taking way too long to execute:

from fuzzywuzzy import process

dfm = pd.DataFrame(df2["Name"].apply(lambda x: process.extractOne(x, df1["Name"])) .tolist()])

I also tried this but it didn't match the names correctly:

import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
    """x=possibilities , y = input"""
    f = partial(
    difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)

    matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
    return pd.DataFrame([y,matches.rename('Name')]).T

temp = get_closest_match(df2['Name'],df1['Name'])
display(temp)

Any ideas? Thanks.

1 Answers1

0

has .cdist() which will score everything in parallel:

import rapidfuzz

scores = rapidfuzz.process.cdist(df1['Name'], df2['Name'], workers=-1)
max_score = df2['Name'][scores.argmax(axis=1)]

df1.merge(df2, left_on=max_score, right_on='Name', how='left')
               Name         Name_x  Performance            Name_y  Year
0     Matt Cole Jr.   Matt Cole Jr           89     Matt Cole Jr.  2023
1   Steve Smith III    Steve Smith           72   Steve Smith III  2022
2       D.J. Watson      DJ Watson           91       D.J. Watson  2021
3      Amare George  Amar'e George           92      Amare George  2020
4      Steve Nurkic   Steve Nurkić           84      Steve Nurkic  2021
5  Charles Long Jr.   Charles Long           87  Charles Long Jr.  2022
6     Nicolas Kitch      Nic Kitch           73     Nicolas Kitch  2023
jqurious
  • 9,953
  • 1
  • 4
  • 14