Find and Remove Duplicates based on Binary Column in Python

Question

I have a pandas dataframe with transaction data as below:

data = [['1',5, 'cvs', '4567', '4/6/2020  12:56:40',0,0], 
['1',5, 'cvs', '4567', '4/6/2020  12:56:41',1,0], 
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',0,0],
['2',7, 'target', '7897', '4/10/2020  12:56:41',0,0],
['2',8, 'walmart', '9898', '4/8/2020  12:56:41',0,1],
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',1,0]]

df = pd.DataFrame(data, columns = ['ID', 'AMOUNT','STORE','CARDNUMBER','PURCHASETIME','DULICATED_FLAG','REVERSAL_FLAG'])

I want to find the duplicates here and remove the duplicates. Ideally, I would like to have a table shown as below:

ID AMOUNT STORE     CARDNUMBER PURCHASETIME         DUPLICATED_FLAG REVERSAL_FLAG
1    5     cvs         4567      4/6/2020  12:56:40     0             0   
1    5     cvs         4567      4/6/2020  12:56:41     1             0  
2    6     walgreens   7897      4/9/2020  12:56:41     0             0
2    6     walgreens   7897      4/9/2020  12:57:41     1             0

Does this answer your question? [python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B](https://stackoverflow.com/questions/12497402/python-pandas-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest) — ConnerWithAnE, Oct 14 '21 at 21:25
What are the criteria for a row to count as a duplicate? I fail to see the pattern between the input data and the desired output. — RJ Adriaansen, Oct 14 '21 at 21:44
It looks like `df.drop_duplicates(['ID', 'DUPLICATED_FLAG'])` @RJAdriaansen (if you correct the word "DULICATED_FLAG" to "DUPLICATED_FLAG") or potentially `df.sort_values('DUPLICATED_FLAG', ascending=False).drop_duplicates(['ID', 'DUPLICATED_FLAG']).sort_index()` — Henry Ecker, Oct 14 '21 at 21:48
For a row to count as duplicate, it requires every column except for `PurchaseTime` and `DUPLICATED_FLAG` to be exact the same but `DUPLICATED_FLAG` need to be opposite. @RJAdriaansen — NightDog, Oct 15 '21 at 13:33

score 0 · Answer 1 · answered Oct 15 '21 at 18:02

Duplicates can be found with duplicated. Since you wish to keep the duplicate rows we need to set keep=True. As duplicated returns boolean values we can use it to filter the dataframe:

data = [['1',5, 'cvs', '4567', '4/6/2020  12:56:40',0,0], 
['1',5, 'cvs', '4567', '4/6/2020  12:56:41',1,0], 
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',0,0],
['2',7, 'target', '7897', '4/10/2020  12:56:41',0,0],
['2',8, 'walmart', '9898', '4/8/2020  12:56:41',0,1],
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',1,0]]

df = pd.DataFrame(data, columns = ['ID', 'AMOUNT','STORE','CARDNUMBER','PURCHASETIME','DUPLICATED_FLAG','REVERSAL_FLAG'])
df = df[df.duplicated(subset=['ID','AMOUNT','STORE','CARDNUMBER','REVERSAL_FLAG'], keep=False)]

Result:

	ID	AMOUNT	STORE	CARDNUMBER	PURCHASETIME	DUPLICATED_FLAG
0	1	5	cvs	4567	4/6/2020 12:56:40	0
1	1	5	cvs	4567	4/6/2020 12:56:41	1
2	2	6	walgreens	7897	4/9/2020 12:56:41	0
5	2	6	walgreens	7897	4/9/2020 12:56:41	1

Find and Remove Duplicates based on Binary Column in Python

1 Answers1