I have an ordered table with multiple ambiguous relationships between old and new contracts.
Old Contract | New Contract |
---|---|
C1 | C3 |
C1 | C4 |
C2 | C3 |
C2 | C4 |
The goal is to have unambiguous relationships, i.e. each old contract should only be related to one new contract and vice versa.
Old Contract | New Contract |
---|---|
C1 | C3 |
C2 | C4 |
I want to drop all rows where either 'Old Contract' OR 'New Contract' are a duplicate of an already kept instance. So, I want to keep the first row of course since it is the first instance. The second row should be dropped because C1 is then a duplicate. The third row should be dropped because C3 is a duplicate here. Finally, the last row should not be dropped although C2 and C4 were already seen before in the data but only in instances which were dropped before.
In my understanding, that cannot be solved with pandas
drop_duplicates
as a subset defines an AND-condition like 'drop only if duplicated in col1 and col2'.
I could only think of a for
loop over the contracts and caching which old and new contracts were already kept.
Is there any less computing-intensive solution as I need to apply this to more than a million of data?
UPDATE: Thanks for the proposed solutions so far even though the problem is not finally solved. I want to extend the abstract example a bit to be more clear what constellations might be possible:
The following
Old Contract | New Contract |
---|---|
C1 | C3 |
C1 | C4 |
C2 | C3 |
C2 | C5 |
C2 | C4 |
C6 | C4 |
should become
Old Contract | New Contract |
---|---|
C1 | C3 |
C2 | C5 |
C6 | C4 |