0

I have two data frames:

import pandas as pd

first_df = pd.DataFrame({'Full Name': ['Mulligan Nick & Mary', 'Tsang S C', 'Hattie J A C '],
                         'Address': ['270 Claude Road', '13 Sunnyridge Place', '18A Empire Road']})

second_df = pd.DataFrame({'Owner' : ['David James Mulligan', 'Brenda Joy Mulligan ', 'Helen Kwok Hattie'],
                          'Add Match': ['19 Dexter Avenue', 'Claude Road ', 'Building NO 512']})

Is there anyway to match only the first string in Full Name column to the last string in Owner column.

If there is a match, I then want to compare Address against Add match to see if there are any like values. If the first condition passes but the second condition fails, this would not be added into the new data frame.

Using a left join results in:

new_df = first_df.merge(second_df, how='left', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])
print(new_df.head())

              Full Name              Address Owner Add Match
0  Mulligan Nick & Mary      270 Claude Road   NaN       NaN
1             Tsang S C  13 Sunnyridge Place   NaN       NaN
2         Hattie J A C       18A Empire Road   NaN       NaN

However the output wanted would look more like this:

new_df

Name                 Address
----                 --------
Brenda Joy Mulligan  Claude Road

Lui Hellesoe
  • 185
  • 1
  • 10
  • Could you add to your question what you have tried so far? Why a simple inner join such as `first_df.merge(second_df, how='inner', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])` wouldn't work? – xicocaio Mar 30 '21 at 11:23
  • Using an inner join, `new_df = first_df.merge(second_df, how='inner', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])`gives me an empty data frame with 20 columns. – Lui Hellesoe Mar 30 '21 at 11:41
  • 1
    If an `inner` join is returning empty results it is because it can't match both keys on the initial dataframes. Change the `how` key to `left` or `right` to see if it is closer to what you are looking for. However, from your examples it is not quite clear what is your problem, try following this guide on [how to write good pandas questions](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Also, do not forget to add to your question what is the solution that you have tried so far. – xicocaio Mar 30 '21 at 11:47

2 Answers2

2

You could take advantage of the difflib module from Python standard library to find similarities between different columns. For instance, you can define the following function:

from difflib import SequenceMatcher

def compare_df(left, right, col: str):
    left[f"{col}_match_ratio"] = 0

    for value in left[col]:
        best_ratio = 0
        for other in right[col]:
            result = SequenceMatcher(None, str(value), str(other)).ratio()
            if result > best_ratio:
                best_ratio = result
        left.loc[left[col] == value, f"{col}_match_ratio"] = round(best_ratio, 2)

Then:

  • you just have to make sure that the column you want to compare on have the same name in both dfs
  • you call df_compare(first_df, second_df, "Owner") which will add "Owner_match_ratio" column to second_df
  • finally, you filter second df on the desired minimum match ratio (70 % for instance) like this: new_df = second_df.loc[second_df["Owner_match_ratio"] > 0.7, :]
Laurent
  • 12,287
  • 7
  • 21
  • 37
1

Inspired by this answer you could employ a similar solution.

TL;DR

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
df_final = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
address_matches = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[address_matches].drop(columns=['last_name', 'start_name', 'Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})

Step-by-step

Initially, you extract the last name keys you want.

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]

PS: Here we are using built-in string methods from pandas/numpy combo given your instructions. But if it fits you better you could also apply the similarity methods (e.g., difflib.get_close_matches) shown down below for the address part.

Next, you perform an inner join of these dataframes to match the last_name key.

df_temp = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])

Then you apply the difflib.get_close_matches with the desired similarity (I used cutoff=0.8 because above this value there were no values returned) method to mark which rows contain matches and subsequently get only the rows you want.

matches_mask = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[matches_mask].drop(columns=['last_name', 'start_name'])
Full Name               Address             Owner                   Add Match

Mulligan Nick & Mary    270 Claude Road     Brenda Joy Mulligan     Claude Road

Finally, to get match the format of the results posted at the end of your question you drop or rename some columns.

df_final.drop(columns=['Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Owner                   Add Match

Brenda Joy Mulligan     Claude Road
xicocaio
  • 867
  • 1
  • 10
  • 27