Given the following data:
data_df = pd.DataFrame({
"Reference": ("A", "A", "A", "B", "C", "C", "D", "E"),
"Value1": ("U", "U", "U--","V", "W", "W--", "X", "Y"),
"Value2": ("u", "u--", "u","v", "w", "w", "x", "y")
}, index=[1, 2, 3, 4, 5, 6, 7, 8])
truth_df = pd.DataFrame({
"Reference": ("A", "B", "C", "D", "E"),
"Value1": ("U", "V", "W", "X", "Y"),
"Value2": ("u", "v", "w", "x", "y")
}, index=[1, 4, 5, 7, 8])
data_df
Reference | Value1 | Value2 | |
---|---|---|---|
1 | A | U | u |
2 | A | U | u-- |
3 | A | U-- | u |
4 | B | V | v |
5 | C | W | w |
6 | C | W-- | w |
7 | D | X | x |
8 | E | Y | y |
truth_df
Reference | Value1 | Value2 | |
---|---|---|---|
1 | A | U | u |
4 | B | V | v |
5 | C | W | w |
7 | D | X | x |
8 | E | Y | y |
The code below merges both and flags the rows where the values don't match from truth_df
df_out = data_df.merge(truth_df, on=['Reference', 'Value1','Value2'], how='left', indicator=True).rename(columns={"_merge":"Issues"})
df_out["Issues"] = np.where(df_out["Issues"] == "left_only", "Flag", "")
df_out
Yields the following dataframe
df_out
Reference | Value1 | Value2 | Issues | |
---|---|---|---|---|
0 | A | U | u | |
1 | A | U | u-- | Flag |
2 | A | U-- | u | Flag |
3 | B | V | v | |
4 | C | W | w | |
5 | C | W-- | w | Flag |
6 | D | X | x | |
7 | E | Y | y |
I would like to know if there is a way to identify which columns are in conflict, so instead of displaying Flag
I'll display the column name like Value1
and Value2
respectively.
desired df
Reference | Value1 | Value2 | Issues | |
---|---|---|---|---|
0 | A | U | u | |
1 | A | U | u-- | Value2 |
2 | A | U-- | u | Value1 |
3 | B | V | v | |
4 | C | W | w | |
5 | C | W-- | w | Value1 |
6 | D | X | x | |
7 | E | Y | y |
Please note that even though the Reference
column looks useful, it is not a reliable source in my real data, so any solutions must not use that column.