-1

I have a pandas dataframe with transaction data as below:

data = [['1',5, 'cvs', '4567', '4/6/2020  12:56:40',0,0], 
['1',5, 'cvs', '4567', '4/6/2020  12:56:41',1,0], 
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',0,0],
['2',7, 'target', '7897', '4/10/2020  12:56:41',0,0],
['2',8, 'walmart', '9898', '4/8/2020  12:56:41',0,1],
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',1,0]]

df = pd.DataFrame(data, columns = ['ID', 'AMOUNT','STORE','CARDNUMBER','PURCHASETIME','DULICATED_FLAG','REVERSAL_FLAG'])

I want to find the duplicates here and remove the duplicates. Ideally, I would like to have a table shown as below:

ID AMOUNT STORE     CARDNUMBER PURCHASETIME         DUPLICATED_FLAG REVERSAL_FLAG
1    5     cvs         4567      4/6/2020  12:56:40     0             0   
1    5     cvs         4567      4/6/2020  12:56:41     1             0  
2    6     walgreens   7897      4/9/2020  12:56:41     0             0
2    6     walgreens   7897      4/9/2020  12:57:41     1             0
NightDog
  • 91
  • 7
  • Does this answer your question? [python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B](https://stackoverflow.com/questions/12497402/python-pandas-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest) – ConnerWithAnE Oct 14 '21 at 21:25
  • 1
    What are the criteria for a row to count as a duplicate? I fail to see the pattern between the input data and the desired output. – RJ Adriaansen Oct 14 '21 at 21:44
  • It looks like `df.drop_duplicates(['ID', 'DUPLICATED_FLAG'])` @RJAdriaansen (if you correct the word "DULICATED_FLAG" to "DUPLICATED_FLAG") or potentially `df.sort_values('DUPLICATED_FLAG', ascending=False).drop_duplicates(['ID', 'DUPLICATED_FLAG']).sort_index()` – Henry Ecker Oct 14 '21 at 21:48
  • For a row to count as duplicate, it requires every column except for `PurchaseTime` and `DUPLICATED_FLAG` to be exact the same but `DUPLICATED_FLAG` need to be opposite. @RJAdriaansen – NightDog Oct 15 '21 at 13:33

1 Answers1

0

Duplicates can be found with duplicated. Since you wish to keep the duplicate rows we need to set keep=True. As duplicated returns boolean values we can use it to filter the dataframe:

data = [['1',5, 'cvs', '4567', '4/6/2020  12:56:40',0,0], 
['1',5, 'cvs', '4567', '4/6/2020  12:56:41',1,0], 
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',0,0],
['2',7, 'target', '7897', '4/10/2020  12:56:41',0,0],
['2',8, 'walmart', '9898', '4/8/2020  12:56:41',0,1],
['2',6, 'walgreens', '7897', '4/9/2020  12:56:41',1,0]]

df = pd.DataFrame(data, columns = ['ID', 'AMOUNT','STORE','CARDNUMBER','PURCHASETIME','DUPLICATED_FLAG','REVERSAL_FLAG'])
df = df[df.duplicated(subset=['ID','AMOUNT','STORE','CARDNUMBER','REVERSAL_FLAG'], keep=False)]

Result:

ID AMOUNT STORE CARDNUMBER PURCHASETIME DUPLICATED_FLAG REVERSAL_FLAG
0 1 5 cvs 4567 4/6/2020 12:56:40 0 0
1 1 5 cvs 4567 4/6/2020 12:56:41 1 0
2 2 6 walgreens 7897 4/9/2020 12:56:41 0 0
5 2 6 walgreens 7897 4/9/2020 12:56:41 1 0
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26