I compare two DataFrames (df1
and df2
) each of which shall have unique rows for a given combination of keys. That means there are not duplicates for keys such as Col1
and Col2
.
import pandas as pd
# OK NOK NOK
df1 = pd.DataFrame({'Col1': ['A', 'A', 'B'],
'Col2': ['J', 'K', 'M'],
'Col3': ['1', '2', '3'],
'Col4': ['UA', 'RU', 'EU']
})
# OK NOK NOK single
df2 = pd.DataFrame({'Col1': ['A', 'A', 'B', 'C'],
'Col2': ['J', 'L', 'N', 'O'],
'Col3': ['1', '2', '3', '4'],
'Col4': ['UA', 'RU', 'non-EU', 'CN']
})
matching_key = ['Col1', 'Col3']
Firstly, I naively identify which rows are not matching
non = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"')
non['_merge'] = (non['_merge'].str.replace('left_only', 'df1').str.replace('right_only', 'df2'))
non = pd.DataFrame({'Col1': ['A', 'A', 'B', 'B', 'C'],
'Col2': ['K', 'L', 'M', 'N', 'O'],
'Col3': ['2', '2', '3', '3', '4'],
'Col4': ['RU', 'RU', 'EU', 'non-EU', 'CN'],
'_merge': ['df1', 'df2', 'df1', 'df2', 'df2']
})
Then I want to find out what column/s cause this mis-match, considering the uniqueness keys (?).
So for instance, to know that Col2
causes mismatch with keys Col1 = 'A'
and Col3 = '2'
, and [Col2, Col4]
cause mismatch with keys Col1 = 'B'
and Col3 = '3'
. Something like
result = {['Col2']: pd.DataFrame({'Col1': ['A', 'A'],
'Col3': ['2', '2'],
'Col2': ['K', 'L']
}),
['Col2', 'Col4']: pd.DataFrame({'Col1': ['B', 'B'],
'Col3': ['3', '3'],
'Col2': ['M', 'N'],
'Col4': ['EU', 'non-EU']
})
}
I am open to any way that helps to quickly get an overview of columns causing mismatch.
One can then tackle issues from least complex (only one column) to more complex (more columns).