32

When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?

import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})

Out[]: 
   col
0  one
1  two
2  NaN
3  NaN
4  NaN
5  two
6  two


df.drop_duplicates(['col'])

Out[]: 
   col
0  one
1  two
2  NaN
Chris Martin
  • 30,334
  • 10
  • 78
  • 137
bioslime
  • 1,821
  • 6
  • 21
  • 27

3 Answers3

36

Try

df[(~df.duplicated()) | (df['col'].isnull())]

The result is :

col
0   one
1   two
2   NaN
3   NaN     
4   NaN
user666
  • 5,231
  • 2
  • 26
  • 35
  • 9
    for a very large dataset, `df=df[df['col'].isnull() | ~df[df['col'].notnull()].duplicated(subset='col',keep='first')]` turns out faster. – user2118915 Aug 13 '18 at 07:55
  • @user2118915 - If there are duplicate values in the index, you might want to `df.reset_index(inplace = True)` beforehand. – Pawel Kranzberg Sep 14 '18 at 10:57
2

Well, one workaround that is not really beautiful is to first save the NaN and put them back in:

temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]: 
   col
0  one
1  two
2  NaN
3  NaN
4  NaN
FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Note that this only saved `NaN` and not other versions of "missing". What exactly you save depends on your data set. Typically, it does not make sense to have several different `missing data` indicators at the same time. – FooBar May 07 '14 at 08:29
0

use:

df.drop_duplicates('col').append(df[df['col'].isna()])
keramat
  • 4,328
  • 6
  • 25
  • 38