I can't figure out why I can't remove duplicates from a Pandas df

Question

I am trying to update a Pandas Dataframe with data from an API and have it written to .csv, I need to be sure it does not contain duplicate rows.

I have been checking on here to see what the problem might be (for example forgetting to add inplace=True), but this doesn't seem to be the case.

So... I have pandas read the csv

df = pd.read_csv(file)

Then I download some more data from the API (I ensured I had duplicate lines) and create df2 (the csv was written by the same code so I am sure that a duplicate line is exactly the same). Now I need to append a dataframe to the other and then drop the duplicates:

df = df.append(df2, ignore_index=True)
df.drop_duplicates(subset=None, keep='first', inplace=True)

then I tried

df = df.drop_duplicates()

I would expect not to see any duplicate row with both, but I must be missing something as those are still there and I can't figure out why. I did check if someone else's question was addressing this, but I noticed how the problem is normally missing the inplace=True part... which I didn't.

You already tried `df = df.drop_duplicates()`? Or, `df = df.drop_duplicates().reset_index(drop=True)` What happens if you omit all the arguments except inplace? — Mark Moretto, Apr 13 '19 at 15:35
I tried it now, but to no avail... editing my question. Thanks. — Francesco Lini, Apr 13 '19 at 15:54
Ah, okay. I see you got an answer. I've never used the keep argument, but now I'm thinking I should! lol — Mark Moretto, Apr 13 '19 at 16:16

score 1 · Accepted Answer · answered Apr 13 '19 at 15:53

1

Is this what you need ?

df.drop_duplicates(keep=False)

answered Apr 13 '19 at 15:53

pythonjokeun

431
2
8

Indeed it was... I didn't understand that keep='first' meant to keep the first duplicate (I thought it meant to keep the first instance). Thanks a lot. – Francesco Lini Apr 13 '19 at 16:01

I can't figure out why I can't remove duplicates from a Pandas df

1 Answers1