Merging data frames on column with string partial match

Question

I am attempting to merge two data frames on the column "Name". The names are relatively the same. The merge is not working because there are discrepancies between the names due to special characters, missing suffixes, and nicknames. Any ideas?

df1

Name	Performance
Matt Cole Jr	89
Steve Smith	72
DJ Watson	91
Amar'e George	92
Steve Nurkić	84
Charles Long	87
Nic Kitch	73

df2

Name	Year
Matt Cole Jr.	2023
Steve Smith III	2022
D.J. Watson	2021
Amare George	2020
Steve Nurkic	2021
Charles Long Jr.	2022
Nicolas Kitch	2023

Desired result:

Name	Year	Performance
Matt Cole Jr.	2023	89
Steve Smith III	2022	72
D.J. Watson	2021	91
Amare George	2020	92
Steve Nurkic	2021	84
Charles Long Jr.	2022	87
Nicolas Kitch	2023	73

I was thinking this could work but the code was taking way too long to execute:

from fuzzywuzzy import process

dfm = pd.DataFrame(df2["Name"].apply(lambda x: process.extractOne(x, df1["Name"])) .tolist()])

I also tried this but it didn't match the names correctly:

import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
    """x=possibilities , y = input"""
    f = partial(
    difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)

    matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
    return pd.DataFrame([y,matches.rename('Name')]).T

temp = get_closest_match(df2['Name'],df1['Name'])
display(temp)

Any ideas? Thanks.

are all names in `df1` also present in `df2` and vice versa? — Stuart, Jun 08 '23 at 16:29

score 0 · Answer 1 · answered Jun 08 '23 at 16:28

rapidfuzz has .cdist() which will score everything in parallel:

import rapidfuzz

scores = rapidfuzz.process.cdist(df1['Name'], df2['Name'], workers=-1)
max_score = df2['Name'][scores.argmax(axis=1)]

df1.merge(df2, left_on=max_score, right_on='Name', how='left')

               Name         Name_x  Performance            Name_y  Year
0     Matt Cole Jr.   Matt Cole Jr           89     Matt Cole Jr.  2023
1   Steve Smith III    Steve Smith           72   Steve Smith III  2022
2       D.J. Watson      DJ Watson           91       D.J. Watson  2021
3      Amare George  Amar'e George           92      Amare George  2020
4      Steve Nurkic   Steve Nurkić           84      Steve Nurkic  2021
5  Charles Long Jr.   Charles Long           87  Charles Long Jr.  2022
6     Nicolas Kitch      Nic Kitch           73     Nicolas Kitch  2023

Merging data frames on column with string partial match

1 Answers1