Find mismatch in all columns between two rows linked by another column in dataframe

Question

I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?

import pandas as pd

df = pd.DataFrame({'col_1':       ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
                    'identifier': [  1,   2,   3,   2,   3,   4,   4],
                    'col_3':      [ 10,  20,   30,  21, 31,  40,  41],
                    'col_4':      [  1,   1,    1,   1,  1,   1,   1]
                    })

In above df, it would be

col_1 for identifier 4 (D vs. E)
col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)

Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.

score 2 · Accepted Answer · answered Jun 09 '22 at 19:49

2

IIUC, you can agregate the columns as sets and keep those with more than one element:

s = df.groupby('identifier').agg(set).stack()
out = s[s.str.len().gt(1)]

output:

identifier       
2           col_3    {20, 21}
3           col_3    {30, 31}
4           col_1      {D, E}
            col_3    {40, 41}
dtype: object

further aggregation:

out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)

output:

identifier
2           [col_3]
3           [col_3]
4    [col_1, col_3]
Name: level_1, dtype: object

answered Jun 09 '22 at 19:49

mozway

194,879
13
39
75

1

I really like the first output, maybe `out.swaplevel(-1, -2).sort_index()` would sort things a little more inline with OP's question? – BeRT2me Jun 09 '22 at 19:59
2

@BeRT2me OP's expected output was unclear, I assumed the "primary key" was the identifier – mozway Jun 09 '22 at 20:08
1

For the second output one can invert with `out.reset_index(level=0)['level_0'].groupby(level=0).agg(list)` – mozway Jun 09 '22 at 20:09
2

Both are possibilities based on OPs lack of details, but you're right that identifier makes more sense as the "p-key". – BeRT2me Jun 09 '22 at 20:12
Both solutions are excellent. This one also excludes columns where there is a perfect match which was also an (unexpressed) intention. Modified the original description. – learning_python_self Jun 09 '22 at 20:17

score 1 · Answer 2 · answered Jun 09 '22 at 19:49

mismatch = df.groupby('identifier').agg(set).applymap(lambda x: x if len(x) > 1 else np.nan)
col_1_mismatch = mismatch[['col_1']].dropna()
col_3_mismatch = mismatch[['col_3']].dropna()
print(col_1_mismatch)
print(col_3_mismatch)

Output:

             col_1
identifier
4           {D, E}


               col_3
identifier
2           {20, 21}
3           {30, 31}
4           {40, 41}

Find mismatch in all columns between two rows linked by another column in dataframe

2 Answers2