Drop duplicates while preserving NaNs in pandas

Question

When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?

import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})

Out[]: 
   col
0  one
1  two
2  NaN
3  NaN
4  NaN
5  two
6  two


df.drop_duplicates(['col'])

Out[]: 
   col
0  one
1  two
2  NaN

score 36 · Answer 1 · answered Mar 08 '16 at 06:22

36

Try

df[(~df.duplicated()) | (df['col'].isnull())]

The result is :

col
0   one
1   two
2   NaN
3   NaN     
4   NaN

answered Mar 08 '16 at 06:22

user666

5,231
2
26
35

9

for a very large dataset, `df=df[df['col'].isnull() | ~df[df['col'].notnull()].duplicated(subset='col',keep='first')]` turns out faster. – user2118915 Aug 13 '18 at 07:55
@user2118915 - If there are duplicate values in the index, you might want to `df.reset_index(inplace = True)` beforehand. – Pawel Kranzberg Sep 14 '18 at 10:57

score 2 · Answer 2 · answered May 07 '14 at 08:22

2

Well, one workaround that is not really beautiful is to first save the NaN and put them back in:

temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]: 
   col
0  one
1  two
2  NaN
3  NaN
4  NaN

answered May 07 '14 at 08:22

FooBar

15,724
19
82
171

Note that this only saved `NaN` and not other versions of "missing". What exactly you save depends on your data set. Typically, it does not make sense to have several different `missing data` indicators at the same time. – FooBar May 07 '14 at 08:29

score 0 · Answer 3 · answered Sep 13 '22 at 16:52

0

use:

df.drop_duplicates('col').append(df[df['col'].isna()])

answered Sep 13 '22 at 16:52

keramat

4,328
6
25
38

Drop duplicates while preserving NaNs in pandas

3 Answers3

Linked