1

Comparing values from a DataFrame agains another

Given the following data

data_df = pd.DataFrame({"Reference": ("A", "A", "A", "B", "C", "C", "D", "E"), "Other col": ("U", "U", "U--","V", "W", "W--", "X", "Y")}, index=[1, 2, 3, 4, 5, 6, 7, 8])

truth_df = pd.DataFrame({"Reference": ("A", "B", "C", "D", "E"), "Other col": ("U", "V", "W", "X", "Y")}, index=[1, 4, 5, 7, 8])

data_df

Reference Value
1 A U
2 A U
3 A Ux
4 B V
5 C W
6 C Ww
7 D X
8 E Y

truth_df

Reference Value
1 A U
4 B V
5 C W
7 D X
8 E Y

I need to check and flag that the values in data_df match that of truth_df and hopefully end up with a new data set like:

result_df

Reference Value Issues
1 A U
2 A U
3 A Ux Wrong
4 B V
5 C W
6 C Ww Wrong
7 D X
Ricardo Sanchez
  • 4,935
  • 11
  • 56
  • 86

2 Answers2

2

You can can create index by key column in both DataFrames with set_index() and then reindex() for same indices (columns names are same, so it's possible to compare). For example:

df2 = df2.set_index('key')
df1 = df1.set_index('key').reindex(df2.index)
print(df1)

then compare:


df3 = df1 != df2 
print(df3)
Feline
  • 773
  • 3
  • 14
  • How about if all I need is to compare a set of my `df` against one row from `df_st`? this set will be the matching `References` so I would like to compare all matching `References` against its matching `Ref` in `df_st` `Address` and `PostCode` hope that makes sense – Ricardo Sanchez May 06 '21 at 21:32
  • I have change the post to make it more clear, hopefully – Ricardo Sanchez May 07 '21 at 11:17
1

You can merge the dataframes with parameter indicator= set to True. If the indicator will be left_only means "Wrong" value in Issue:

df_out = data_df.merge(
    truth_df, on=["Reference", "Other col"], how="left", indicator=True
).rename(columns={"_merge": "Issues"})

df_out["Issues"] = np.where(df_out["Issues"] == "left_only", "Wrong", "")

print(df_out)

Prints:

  Reference Other col Issues
0         A         U       
1         A         U       
2         A       U--  Wrong
3         B         V       
4         C         W       
5         C       W--  Wrong
6         D         X       
7         E         Y       
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91