Dropping duplicates only if found twice

Question

I have a dataframe with claim numbers, which is an 12 digit number. I am trying to take out reversed claims, which would be 2 claims of a paid claim and reversed claim. There are instances where a claim was processed and reversed, but then it was reprocessed. These situations have 3 duplicate claim numbers. I want to drop reversed claims, which would only show 2 duplicate claim numbers. This would leave me with only paid claims and claims that were reprocessed. I am having trouble writing the drop_duplicates in python. When I do df.drop_duplicates(subset='claim_number', keep=False, inplace=True), I get rid of reversed claims and reprocessed claims. Any help would be appreciated!

 In [2]: df
 Out[2]:
     A  
  0  207667742791  
  1  207667743011  
  2  207667743361
  3  207667743361
  4  214063686631
  5  214063686631
  6  214063686631

Desired Output:

In [2]: df
Out[2]:
     A  
  0  207667742791  
  1  207667743011  
  2  214063686631
  3  214063686631
  4  214063686631

Can you provide an example dataframe and the expected output? — JANO, Jan 21 '22 at 17:08

score 0 · Answer 1 · answered Jan 21 '22 at 20:05

You can groupby your column and get the number of items with transform('count') to make a mask. If exactly 2 items, drop it:

mask = df.groupby('A')['A'].transform('size').ne(2)
df2 = df[mask]

output:

              A
0  207667742791
1  207667743011
4  214063686631
5  214063686631
6  214063686631

Dropping duplicates only if found twice

1 Answers1