Pandas - Merge two DataFrame with partial match

Question

Having the data frames illustrated in the image below, I would like to merge on ['A','B','C'] and ['X','Y','Z'] first then gradually look for a match with one less column, I.E ['A','B'] and ['X','Y'] then ['A'] and ['X'] without duplicating the rows of the result, in the example below a,y,y,v3 is left out since a,d,d already matched.

My code so far, matches on all 3 columns:

df1 = pd.DataFrame({"A":['a','b','c'],"B":['d','e','f'],"C":['d','e','f']})
df2 = pd.DataFrame({"X":['a','b','a','c'],"Y":['d','e','y','z'],"Z":['d','x','y','z'],"V":['v1','v2','v3','v4']})

merged = pd.merge(df1,df2,left_on=['A','B','C'],right_on=['X','Y','Z'], how='left')
merged = merged.drop_duplicates(['A','B','C'])
merged.head()

How can I achieve my goal?

Update: expected output

`a,y,y,v3` is already left out because you already have a row that matches the whole 3 columns? Could you also add the expected output? — Dani Mesejo, Dec 01 '20 at 12:45

jezrael · Accepted Answer · 2020-12-01T14:35:40.307

One idea with multiple merge in loop with DataFrame.drop_duplicates for second DataFrame what should avoid duplicated rows in final DataFrame:

from functools import reduce

dfs = []
L = [['A', 'B', 'C'], ['X', 'Y', 'Z']]

for i in range(len(L[0]), 0, -1):
    df22 = df2.drop_duplicates(L[1][:i])
    df = pd.merge(df1,df22,left_on=L[0][:i],right_on=L[1][:i], how='left')
    dfs.append(df)

df = reduce(lambda l,r: pd.DataFrame.fillna(l,r), dfs)
print (df)
   A  B  C  X  Y  Z   V
0  a  d  d  a  d  d  v1
1  b  e  e  b  e  x  v2
2  c  f  f  c  z  z  v4

working like:

merged1 = pd.merge(df1,df2.drop_duplicates(['X','Y','Z']),left_on=['A','B','C'],right_on=['X','Y','Z'], how='left')
merged2 = pd.merge(df1,df2.drop_duplicates(['X','Y']),left_on=['A','B'],right_on=['X','Y'], how='left')
merged3 = pd.merge(df1,df2.drop_duplicates('X'),left_on=['A'],right_on=['X'], how='left')

df = merged1.fillna(merged2).fillna(merged3)
print (df)
   A  B  C  X  Y  Z   V
0  a  d  d  a  d  d  v1
1  b  e  e  b  e  x  v2
2  c  f  f  c  z  z  v4

can you please have a look here: https://stackoverflow.com/questions/65561016/pandas-expanding-average-session-time ? — Shlomi Schwartz, Jan 04 '21 at 10:40

tgrandje · Answer 2 · 2020-12-01T14:47:48.257

What about this :

matches = [['A', 'B', 'C'], ['X', 'Y', 'Z']]
df = df1.copy()
for k in range(len(matches[0])):

    #Get your left/right keys right at each iteration :
    left, right = matches
    left = left if k==0 else left[:-k]
    right = right if k==0 else right[:-k]

    #Make sure columns from df2 exist in df
    for col in df2.columns.tolist():
        try:
            df[col]
        except Exception:
            df[col] = np.nan

    #Merge dataframes
    df = df.merge(df2, left_on=left, right_on=right, how='left')

    #Find which row of df's "left" columns (previously initialised) are empty
    ix_left_part = np.all([df[x + "_x"].isnull() for x in right], axis=0)

    #Find which row of df's "right" columns are not empty
    ix_right_part = np.all([df[x + "_y"].notnull() for x in right], axis=0)

    #Combine both to get indexes
    ix = df[ix_left_part & ix_right_part].index

    #Complete values on "left" with those from "right"
    for x in df2.columns.tolist():
        df.loc[ix, x+"_x"] = df.loc[ix, x+'_y']

    #Drop values from "right"
    df.drop([x+"_y" for x  in df2.columns.tolist()], axis=1, inplace=True)

    #Rename "left" columns to stick with original names from df2
    df.rename({x+"_x":x for x  in df2.columns.tolist()}, axis=1, inplace=True)

#drop eventual duplicates
df.drop_duplicates(keep="first", inplace=True)
print(df)

EDIT

I corrected the loop ; this should be easier on the memory :

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"A":['a','b','c'],"B":['d','e','f'],"C":['d','e','f']})
df2 = pd.DataFrame({"X":['a','b','a','c'],"Y":['d','e','y','z'],"Z":['d','x','y','z'],"V":['v1','v2','v3','v4']})

matches = [['A', 'B', 'C'], ['X', 'Y', 'Z']]
df = df1.copy()

#Make sure columns of df2 exist in df
for col in df2.columns.tolist():
    df[col] = np.nan

for k in range(len(matches[0])):

    #Get your left/right keys right at each iteration :
    left, right = matches
    left = left if k==0 else left[:-k]
    right = right if k==0 else right[:-k]
    
    #recreate dataframe of (potential) usable datas in df2:
    ix = df[df.V.isnull()].index
    temp = (
            df.loc[ix, left]
            .rename(dict(zip(left, right)), axis=1)
            )
    
    temp=temp.merge(df2, on=right, how="inner")
    
    #Merge dataframes
    df = df.merge(temp, left_on=left, right_on=right, how='left')
    
    
    #Combine both to get indexes
    ix = df[(df['V_x'].isnull()) & (df['V_y'].notnull())].index
    

    #Complete values on "left" with those from "right"
    cols_left = [x+'_x' for x in df2.columns.tolist()]
    cols_right = [x+'_y' for x in df2.columns.tolist()]    
    df.loc[ix, cols_left] = df.loc[ix, cols_right].values.tolist()
        
    #Drop values from "right"
    df.drop(cols_right, axis=1, inplace=True)
    
    #Rename "left" columns to stick with original names from df2
    rename = {x+"_x":x for x  in df2.columns.tolist()}
    df.rename(rename, axis=1, inplace=True)

print(df)

I think I messed up something around the "ix_left_part", you should'nt get duplicates at the end... I think I now what it is, but correcting it will depends if there is one "target" column (meaning 'V' on your real df2). — tgrandje, Dec 01 '20 at 13:02
there is one target, the V column. I'll test the code and let you know — Shlomi Schwartz, Dec 01 '20 at 13:06
Your code works on the dummy data I provided, kudos for that...however, It consumes a lot of memory and crash my environment when I run it on the actual data :( — Shlomi Schwartz, Dec 01 '20 at 13:30
df1 is (10000, 3) and df2 is (137503, 3) , there are only 2 columns to match in my actual data, but I was looking for a generic answer, like the one you provided. — Shlomi Schwartz, Dec 01 '20 at 13:33
Uh, so say again : how many "links" columns and how many "target" columns do you have ? — tgrandje, Dec 01 '20 at 13:35
in my actual data: 2 link columns, one target column in df2 and one in df1 — Shlomi Schwartz, Dec 01 '20 at 13:36

Pandas - Merge two DataFrame with partial match

2 Answers2

Linked