Python Pandas df.duplicated() false positives

Question

I am running into an issue of df.duplicated() erroneously returning true. When I reset the index (df.reset_index()) df.duplicates() returns the correct result.

This issue was raised in 2013 however, the cause was not identified, just a work-around. I am experiencing the problem now after reading data in from an SQL database. I would greatly appreciate if someone has a solution, as i don't want to have to resort to resetting the index of a df everytime I need to run the .duplicated() method.

I get the following when I display the 'duplicates' using df[df.duplicated()]:

name        type  code 
John Doe    A     6532  
Jane Doe    A     1124 
Rudolph Doe B     3412

None of these are duplicated. After I perform df.reset_index() I get completely different (and correct) results.

I'm quite confused and have scoured the Internet for a solution. I appreciate any help one could provide.

I'm using the latest Pandas (0.19.1) release. However, I tried this with 0.18 and had the same problem.

How do you know that none of these are duplicates? Are you aware that the default of .duplicated is 'first' which does not set the first occurrence to True... so if you only have 2 duplicates only the last element will be returned? — schlump, Nov 16 '16 at 20:18
you really got to show both df and df.duplicated() if you expect anyone to be able to help you. That said, note that the index will NOT factor into the calculation of duplicated. But after you reset_index, the index becomes a regular column and DOES factor into the calculation. So it absolutely is expected that reset_index would make a difference (and if the index is unique then nothing will be a duplicate after doing reset_index, soley because the index itself is unique). — JohnE, Nov 17 '16 at 00:57

score 0 · Answer 1 · answered Nov 18 '16 at 00:19

One stick of my RAM died today. Once replaced this problem ceased to be an issue. I'm going under the assumption that this is what was causing my problems as I have had no issues after replacing the RAM.

Thank you for the comments and attempts to help. I really appreciate it.

Python Pandas df.duplicated() false positives

1 Answers1