Pandas.Dataframe.duplicated() includes missing rows as duplicates

Question

I have a Pandas Dataframe named Merged that has an attribute named RegimentalNumber.

I'm using the Pandas.Dataframe.duplicated() method to retrieve the duplicates from this dataframe like so:

In [16]: Merged[Merged.RegimentalNumber.duplicated() == True]

However, it looks like the result also includes missing values of RegimentalNumber as duplicates.

Does the duplicated() method take a flag or parameter to exclude missing values as duplicates? I took a look at the API Documentation for this method but could not find such a flag.

Of course I can then simply exclude the missing values like this:

In [17]: duplicates = Merged[Merged.RegimentalNumber.duplicated() == True]
In [18]: duplicates[duplicates.RegimentalNumber.notnull()]

However, it doesn't seem right to me that the duplicated() method also includes missing values as duplicates. Is there a simpler, one step solution?

`duplicated` should treat `NaN` as `False` so you'll have to post raw data and code that reproduces what you're seeing — EdChum, Apr 26 '16 at 19:57

unutbu · Accepted Answer · 2016-04-26T20:19:05.307

You could use df.dropna() to ensure NULL values are ignored. For example,

import numpy as np
import pandas as pd
df = pd.DataFrame({'foo': [1, np.nan, 1, 2, 3, 2, 3, np.nan, float('nan'), 
                           np.nan, float('nan'), 'xyz']})

print(df.dropna().loc[df['foo'].duplicated()])

yields

Note that df['foo'].duplicated() is a boolean Series whose index may be larger than df.dropna().index. However, when you select rows using df.dropna().loc, the boolean Series index is reindexed to match df.dropna().index, so the NULL values conveniently get dropped.

Pandas.Dataframe.duplicated() includes missing rows as duplicates

1 Answers1