How to get Pandas df.merge() mismatch column name

Question

Given the following data:

data_df = pd.DataFrame({
    "Reference": ("A", "A", "A", "B", "C", "C", "D", "E"),
    "Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"),
    "Value2": ("u", "u--", "u","v", "w", "w", "x", "y")
    }, index=[1, 2, 3, 4, 5, 6, 7, 8])
truth_df = pd.DataFrame({
    "Reference": ("A", "B", "C", "D", "E"),
    "Value1": ("U", "V", "W", "X", "Y"),
    "Value2": ("u", "v", "w", "x", "y")
    }, index=[1, 4, 5, 7, 8])

data_df

	Reference	Value1	Value2
1	A	U	u
2	A	U	u--
3	A	U--	u
4	B	V	v
5	C	W	w
6	C	W--	w
7	D	X	x
8	E	Y	y

truth_df

	Reference	Value1	Value2
1	A	U	u
4	B	V	v
5	C	W	w
7	D	X	x
8	E	Y	y

The code below merges both and flags the rows where the values don't match from truth_df

df_out = data_df.merge(truth_df, on=['Reference', 'Value1','Value2'], how='left', indicator=True).rename(columns={"_merge":"Issues"})
df_out["Issues"] = np.where(df_out["Issues"] == "left_only", "Flag", "")
df_out

Yields the following dataframe

df_out

	Reference	Value1	Value2	Issues
0	A	U	u
1	A	U	u--	Flag
2	A	U--	u	Flag
3	B	V	v
4	C	W	w
5	C	W--	w	Flag
6	D	X	x
7	E	Y	y

I would like to know if there is a way to identify which columns are in conflict, so instead of displaying Flag I'll display the column name like Value1 and Value2 respectively.

desired df

	Reference	Value1	Value2	Issues
0	A	U	u
1	A	U	u--	Value2
2	A	U--	u	Value1
3	B	V	v
4	C	W	w
5	C	W--	w	Value1
6	D	X	x
7	E	Y	y

Please note that even though the Reference column looks useful, it is not a reliable source in my real data, so any solutions must not use that column.

By comparing against the `truth_df` – Ricardo Sanchez May 09 '21 at 12:17 — Ricardo Sanchez, May 09 '21 at 12:17

score 4 · Answer 1 · answered May 09 '21 at 12:21

You can set index of both the dataframes as index using df.set_index then check for equality using df.ne then use df.dot with columns.

data_df = data_df.set_index('Reference') # Ignore if Reference is index already
truth_df = truth_df.set_index('Reference') # Ignore if Reference is index already
data_df['issue'] = data_df.ne(truth_df, axis=1).dot(data_df.columns)
print(data_df.reset_index())

  Reference Value1 Value2   issue
0         A      U      u        
1         A      U    u--  Value2
2         A    U--      u  Value1
3         B      V      v        
4         C      W      w        
5         C    W--      w  Value1
6         D      X      x        
7         E      Y      y

Yes like that but, is there another way without using the `Reference` column? I should have mention that in my post (I will change it) `Reference` is not a reliable column to use in with my real data — Ricardo Sanchez, May 09 '21 at 18:30

score 0 · Answer 2 · answered May 09 '21 at 12:18

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

data_df = pd.DataFrame({
    "Reference": ("A", "A", "A", "B", "C", "C", "D", "E"),
    "Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"),
    "Value2": ("u", "u--", "u","v", "w", "w", "x", "y")
    }, index=[1, 2, 3, 4, 5, 6, 7, 8])
truth_df = pd.DataFrame({
    "Reference": ("A", "B", "C", "D", "E"),
    "Value1": ("U", "V", "W", "X", "Y"),
    "Value2": ("u", "v", "w", "x", "y")
    }, index=[1, 4, 5, 7, 8])

for col in data_df.columns:
    flaged = col

df_out = data_df.merge(truth_df, on=['Reference', 'Value1','Value2'], how='left', indicator=True).rename(columns={"_merge":"Issues"})
df_out["Issues"] = np.where(df_out["Issues"] == "left_only", flaged, "")
print(df_out)

Try this.

Thank you but it only Flags `Value2` as the columns and that is not the case on all of them — Ricardo Sanchez, May 09 '21 at 12:20

score 0 · Answer 3 · answered May 09 '21 at 15:16

Another approach (using the df_out you already have):

def fn(x):
    truth_idx = truth_df["Reference"].eq(x["Reference"]).idxmax()
    if x["Value1"] == truth_df.at[truth_idx, "Value1"]:
        return "Value2"
    return "Value1"


m = df_out["Issues"] == "Flag"
df_out.loc[m, "Issues"] = df_out.loc[m, ["Reference", "Value1"]].apply(
    fn,
    axis=1,
)
print(df_out)

Prints:

  Reference Value1 Value2  Issues
0         A      U      u        
1         A      U    u--  Value2
2         A    U--      u  Value1
3         B      V      v        
4         C      W      w        
5         C    W--      w  Value1
6         D      X      x        
7         E      Y      y

Sorry I forgot to mention that the `Reference` column is not a reliable column in my real data so I cannot use it. — Ricardo Sanchez, May 09 '21 at 18:35

How to get Pandas df.merge() mismatch column name

3 Answers3

Linked