2

Say I have below dataframe:

A B C D E
3 2 1 4 5
3 2 1 2 3
4 5 6 7 8
4 5 6 9 8
9 3 8 5 4

I would like to drop duplicates based on columns A, B and C, keeping the rows for which column E is the highest. And if the values in column E are the same, then keeping the rows for which the column D is the highest.

So above dataframe would become:

A B C D E
3 2 1 4 5
4 5 6 9 8
9 3 8 5 4

I saw a beginning of answer there: python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B but unfortunately I can't find out how to handle the if the values are the same in column E then keep the highest from column D :/

(I am running this code on a quite large dataset)

Any help appreciated !

Pauline
  • 92
  • 5

1 Answers1

1

you can sort the frame first according to the E, D criterion in descending order and then drop the duplicates:

df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38