Drop duplicates based on two columns with an OR condition

Question

I have an ordered table with multiple ambiguous relationships between old and new contracts.

Old Contract	New Contract
C1	C3
C1	C4
C2	C3
C2	C4

The goal is to have unambiguous relationships, i.e. each old contract should only be related to one new contract and vice versa.

Old Contract	New Contract
C1	C3
C2	C4

I want to drop all rows where either 'Old Contract' OR 'New Contract' are a duplicate of an already kept instance. So, I want to keep the first row of course since it is the first instance. The second row should be dropped because C1 is then a duplicate. The third row should be dropped because C3 is a duplicate here. Finally, the last row should not be dropped although C2 and C4 were already seen before in the data but only in instances which were dropped before.

In my understanding, that cannot be solved with pandas drop_duplicates as a subset defines an AND-condition like 'drop only if duplicated in col1 and col2'.

I could only think of a for loop over the contracts and caching which old and new contracts were already kept.

Is there any less computing-intensive solution as I need to apply this to more than a million of data?

UPDATE: Thanks for the proposed solutions so far even though the problem is not finally solved. I want to extend the abstract example a bit to be more clear what constellations might be possible:

The following

Old Contract	New Contract
C1	C3
C1	C4
C2	C3
C2	C5
C2	C4
C6	C4

should become

Old Contract	New Contract
C1	C3
C2	C5
C6	C4

Maybe, drop_duplicates with col1 and then drop_duplicates with col2 ? — Rain.Wei, Dec 03 '21 at 11:20
Is this even possible to do without a loop? I mean, when you encounter the first row, all consecutive rows with those value should be removed before you can determine whether something in the next row is a duplicate or not. Because of that it seems to me that there has to be some tracking of what has been seen before. — eandklahn, Dec 03 '21 at 13:47
@GedasMiksenas The relation C1 --> C3 is in the table before C1 --> C4. Thus, the preferred relation is C1 --> C3 — Jan, Dec 03 '21 at 16:05
@E.Klahn I am afraid you could be right. Just hoped that it could be possible to find a better solution. — Jan, Dec 03 '21 at 16:09
@Jan I agree it would be cleaner with something that didn't require iteration. But just out of curiosity, will you at any point have so much data that an iteration becomes a serious speed issue? — eandklahn, Dec 08 '21 at 11:11
@E.Klahn I indeed have a scenario where I possibly would need to re-apply it weekly to 1-2m contract relations as the technical setup does not directly allow to store the results. Tested an iterative solution (for-loop) for 500k rows of contract relations with around 35k unique old contracts, and it took 2h. So, another point here is to adjust the technical setup so that I don't have to re-apply the logic over again to all the contracts. — Jan, Dec 12 '21 at 18:02

Drop duplicates based on two columns with an OR condition

0 Answers0