I am attempting to merge two data frames on the column "Name". The names are relatively the same. The merge is not working because there are discrepancies between the names due to special characters, missing suffixes, and nicknames. Any ideas?
df1
Name | Performance |
---|---|
Matt Cole Jr | 89 |
Steve Smith | 72 |
DJ Watson | 91 |
Amar'e George | 92 |
Steve Nurkić | 84 |
Charles Long | 87 |
Nic Kitch | 73 |
df2
Name | Year |
---|---|
Matt Cole Jr. | 2023 |
Steve Smith III | 2022 |
D.J. Watson | 2021 |
Amare George | 2020 |
Steve Nurkic | 2021 |
Charles Long Jr. | 2022 |
Nicolas Kitch | 2023 |
Desired result:
Name | Year | Performance |
---|---|---|
Matt Cole Jr. | 2023 | 89 |
Steve Smith III | 2022 | 72 |
D.J. Watson | 2021 | 91 |
Amare George | 2020 | 92 |
Steve Nurkic | 2021 | 84 |
Charles Long Jr. | 2022 | 87 |
Nicolas Kitch | 2023 | 73 |
I was thinking this could work but the code was taking way too long to execute:
from fuzzywuzzy import process
dfm = pd.DataFrame(df2["Name"].apply(lambda x: process.extractOne(x, df1["Name"])) .tolist()])
I also tried this but it didn't match the names correctly:
import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
"""x=possibilities , y = input"""
f = partial(
difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)
matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
return pd.DataFrame([y,matches.rename('Name')]).T
temp = get_closest_match(df2['Name'],df1['Name'])
display(temp)
Any ideas? Thanks.