why does python not drop all duplicates?

Question

I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'.

This is my code:

df=df.astype(str)

df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)

df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)

print(df)

This is the output dataframe, as you can see the first row is a duplicate on both subsets. So why is this row stil there?

I do not just want to remove the first row but all duplicates. Tis is another output where also for Index/Node 6 there is a duplicate.

Please do not post images of data. This makes it more difficult for people to help you! Just paste it in as text. Take a look at this: https://stackoverflow.com/help/how-to-ask — Dave, May 06 '20 at 13:40

sygneto · Answer 1 · 2020-05-06T13:43:59.977

0

df=df.astype(str)

df = df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)

df = df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)

I assume that cost_x should be replaced with head_y, in other way there are no duplicates

edited May 06 '20 at 13:43

answered May 06 '20 at 13:37

sygneto

1,761
1
13
26

score 0 · Accepted Answer · answered May 06 '20 at 13:42

Take a look at the first 2 rows:

      head_x  cost_x  head_y  cost_y
Node
1          2       6       2       3
1          2       6       3       4

Start from head_x and head_y:

from the first row are 2 and 2,
from the second row are 2 and 3,

so these two pairs are different.

Then look at cost_x and cost_y:

from the first row are 6 and 3,
from the second row are 6 and 4,

so these two pairs are also different.

Conclusion: These 2 rows are not duplicates, taking into account both column subsets.

why does python not drop all duplicates?

2 Answers2

I assume that cost_x should be replaced with head_y, in other way there are no duplicates