0

I have an ordered table with multiple ambiguous relationships between old and new contracts.

Old Contract New Contract
C1 C3
C1 C4
C2 C3
C2 C4

The goal is to have unambiguous relationships, i.e. each old contract should only be related to one new contract and vice versa.

Old Contract New Contract
C1 C3
C2 C4

I want to drop all rows where either 'Old Contract' OR 'New Contract' are a duplicate of an already kept instance. So, I want to keep the first row of course since it is the first instance. The second row should be dropped because C1 is then a duplicate. The third row should be dropped because C3 is a duplicate here. Finally, the last row should not be dropped although C2 and C4 were already seen before in the data but only in instances which were dropped before.

In my understanding, that cannot be solved with pandas drop_duplicates as a subset defines an AND-condition like 'drop only if duplicated in col1 and col2'.

I could only think of a for loop over the contracts and caching which old and new contracts were already kept.

Is there any less computing-intensive solution as I need to apply this to more than a million of data?

UPDATE: Thanks for the proposed solutions so far even though the problem is not finally solved. I want to extend the abstract example a bit to be more clear what constellations might be possible:

The following

Old Contract New Contract
C1 C3
C1 C4
C2 C3
C2 C5
C2 C4
C6 C4

should become

Old Contract New Contract
C1 C3
C2 C5
C6 C4
Jan
  • 1
  • 1
  • Maybe, drop_duplicates with col1 and then drop_duplicates with col2 ? – Rain.Wei Dec 03 '21 at 11:20
  • @Rain.Wei it will not work as he wants a xor and not an or – Learning is a mess Dec 03 '21 at 11:20
  • @Learningisamess Pardon me for my misunderstanding. – Rain.Wei Dec 03 '21 at 11:25
  • Why C1 = C3 if C1 actually has C4 as newer ? – Gedas Miksenas Dec 03 '21 at 12:18
  • Is this even possible to do without a loop? I mean, when you encounter the first row, all consecutive rows with those value should be removed before you can determine whether something in the next row is a duplicate or not. Because of that it seems to me that there has to be some tracking of what has been seen before. – eandklahn Dec 03 '21 at 13:47
  • @GedasMiksenas The relation C1 --> C3 is in the table before C1 --> C4. Thus, the preferred relation is C1 --> C3 – Jan Dec 03 '21 at 16:05
  • @E.Klahn I am afraid you could be right. Just hoped that it could be possible to find a better solution. – Jan Dec 03 '21 at 16:09
  • @Jan I agree it would be cleaner with something that didn't require iteration. But just out of curiosity, will you at any point have so much data that an iteration becomes a serious speed issue? – eandklahn Dec 08 '21 at 11:11
  • @E.Klahn I indeed have a scenario where I possibly would need to re-apply it weekly to 1-2m contract relations as the technical setup does not directly allow to store the results. Tested an iterative solution (for-loop) for 500k rows of contract relations with around 35k unique old contracts, and it took 2h. So, another point here is to adjust the technical setup so that I don't have to re-apply the logic over again to all the contracts. – Jan Dec 12 '21 at 18:02

0 Answers0