4

Given the following data:

data_df = pd.DataFrame({
    "Reference": ("A", "A", "A", "B", "C", "C", "D", "E"),
    "Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"),
    "Value2": ("u", "u--", "u","v", "w", "w", "x", "y")
    }, index=[1, 2, 3, 4, 5, 6, 7, 8])
truth_df = pd.DataFrame({
    "Reference": ("A", "B", "C", "D", "E"),
    "Value1": ("U", "V", "W", "X", "Y"),
    "Value2": ("u", "v", "w", "x", "y")
    }, index=[1, 4, 5, 7, 8])

data_df

Reference Value1 Value2
1 A U u
2 A U u--
3 A U-- u
4 B V v
5 C W w
6 C W-- w
7 D X x
8 E Y y

truth_df

Reference Value1 Value2
1 A U u
4 B V v
5 C W w
7 D X x
8 E Y y

The code below merges both and flags the rows where the values don't match from truth_df

df_out = data_df.merge(truth_df, on=['Reference', 'Value1','Value2'], how='left', indicator=True).rename(columns={"_merge":"Issues"})
df_out["Issues"] = np.where(df_out["Issues"] == "left_only", "Flag", "")
df_out

Yields the following dataframe

df_out

Reference Value1 Value2 Issues
0 A U u
1 A U u-- Flag
2 A U-- u Flag
3 B V v
4 C W w
5 C W-- w Flag
6 D X x
7 E Y y

I would like to know if there is a way to identify which columns are in conflict, so instead of displaying Flag I'll display the column name like Value1 and Value2 respectively.

desired df

Reference Value1 Value2 Issues
0 A U u
1 A U u-- Value2
2 A U-- u Value1
3 B V v
4 C W w
5 C W-- w Value1
6 D X x
7 E Y y

Please note that even though the Reference column looks useful, it is not a reliable source in my real data, so any solutions must not use that column.

Ricardo Sanchez
  • 4,935
  • 11
  • 56
  • 86

3 Answers3

4

You can set index of both the dataframes as index using df.set_index then check for equality using df.ne then use df.dot with columns.

data_df = data_df.set_index('Reference') # Ignore if Reference is index already
truth_df = truth_df.set_index('Reference') # Ignore if Reference is index already
data_df['issue'] = data_df.ne(truth_df, axis=1).dot(data_df.columns)
print(data_df.reset_index())

  Reference Value1 Value2   issue
0         A      U      u        
1         A      U    u--  Value2
2         A    U--      u  Value1
3         B      V      v        
4         C      W      w        
5         C    W--      w  Value1
6         D      X      x        
7         E      Y      y        
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • Yes like that but, is there another way without using the `Reference` column? I should have mention that in my post (I will change it) `Reference` is not a reliable column to use in with my real data – Ricardo Sanchez May 09 '21 at 18:30
0
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

data_df = pd.DataFrame({
    "Reference": ("A", "A", "A", "B", "C", "C", "D", "E"),
    "Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"),
    "Value2": ("u", "u--", "u","v", "w", "w", "x", "y")
    }, index=[1, 2, 3, 4, 5, 6, 7, 8])
truth_df = pd.DataFrame({
    "Reference": ("A", "B", "C", "D", "E"),
    "Value1": ("U", "V", "W", "X", "Y"),
    "Value2": ("u", "v", "w", "x", "y")
    }, index=[1, 4, 5, 7, 8])

for col in data_df.columns:
    flaged = col

df_out = data_df.merge(truth_df, on=['Reference', 'Value1','Value2'], how='left', indicator=True).rename(columns={"_merge":"Issues"})
df_out["Issues"] = np.where(df_out["Issues"] == "left_only", flaged, "")
print(df_out)

Try this.

Coelll
  • 65
  • 1
  • 10
0

Another approach (using the df_out you already have):

def fn(x):
    truth_idx = truth_df["Reference"].eq(x["Reference"]).idxmax()
    if x["Value1"] == truth_df.at[truth_idx, "Value1"]:
        return "Value2"
    return "Value1"


m = df_out["Issues"] == "Flag"
df_out.loc[m, "Issues"] = df_out.loc[m, ["Reference", "Value1"]].apply(
    fn,
    axis=1,
)
print(df_out)

Prints:

  Reference Value1 Value2  Issues
0         A      U      u        
1         A      U    u--  Value2
2         A    U--      u  Value1
3         B      V      v        
4         C      W      w        
5         C    W--      w  Value1
6         D      X      x        
7         E      Y      y        
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91