df.duplicated() false positives?

Question

I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex

MultiIndex.levels.names = ['year', 'country', 'productcode']

I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:

ReshapeError: Index contains duplicate entries, cannot reshape

I have used:

data[data.duplicated()]

to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.

This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).

An Example from the sorted csv file:

year country productcode duplicate
1962    MYS     711       FALSE
1962    MYS     712       TRUE
1962    MYS     721       FALSE

I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?

Update: I have tried resetting the index

temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]

and I get a completely different list!

year    country productcode
1994      HKG      9710
1994      USA      9710
1995      HKG      9710
1995      USA      9710

Updated 2 [28JUNE2013]:

It appears to have been a strange memory issue during my IPython Session. This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?

Could you try data.index.get_duplicates() and paste what you get? The error you got was that the *Index* contained duplicates, not the rows. — TomAugspurger, Jun 27 '13 at 14:01
Thanks Tom ... I haven't seen the get_duplicates() method on the index type. Helpful to know about! :) — sanguineturtle, Jun 28 '13 at 04:59
For the debugger, checkout ipdb. It works great in the terminal and QtConsole, and I think support is coming for the notebook in the upcoming IPython 1.0 release. — TomAugspurger, Jun 28 '13 at 13:59

score 3 · Accepted Answer · answered Jun 27 '13 at 17:13

3

perhaps try

cleaned = df.reset_index().drop_duplicates(df.index.names)
cleaned.set_index(df.index.names, inplace=True)

I think there ought to be a duplicated method in the index, there is not yet

https://github.com/pydata/pandas/issues/4060

answered Jun 27 '13 at 17:13

Wes McKinney

101,437
32
142
108

Thanks Wes! Good approach I hadn't thought of. – sanguineturtle Jun 28 '13 at 05:08

df.duplicated() false positives?

1 Answers1

Linked