3

I have a large dataframe that I want to split when all columns are nan or don't have a finite value. I am looking for something similar to the post Drop rows of pandas dataframe that don't have finite values in certain variable(s) but rather than dropping I'd like to split on those rows.

I am currently on pandas 0.16.0

Community
  • 1
  • 1
dlwlrma
  • 99
  • 2
  • 13
  • does `df[df.apply(lambda x: x.isnull().all(), axis=1)]` work? – EdChum Feb 23 '16 at 15:39
  • Also doesn't `df.dropna(how='all')` return you this? – EdChum Feb 23 '16 at 15:42
  • @EdChum absolutely perfect. Thank you. the dropna returns the dataframe without the nans, not the rows with the nans. – dlwlrma Feb 23 '16 at 15:42
  • Which worked, the first suggestion?, it will be slow for a large df, not sure if it's quicker to do `df.loc[df.index.difference(df.dropna(how='all').index)]` – EdChum Feb 23 '16 at 15:45

2 Answers2

1

As @EdChum has pointed out

df[df.apply(lambda x: x.isnull().all(), axis=1)]

does the trick.

dlwlrma
  • 99
  • 2
  • 13
1

It'll be quicker to filter the non-NaN rows from your df by calling index.difference on the index labels returned from dropna:

In [69]:
df = pd.DataFrame({'a':[0,np.NaN, 0], 'b':[np.NaN, np.NaN, 1]})
df = pd.concat([df]*10000, ignore_index=True)   

%timeit df[df.apply(lambda x: x.isnull().all(), axis=1)]
%timeit df.loc[df.index.difference(df.dropna(how='all').index)]

1 loops, best of 3: 2.82 s per loop
100 loops, best of 3: 8.95 ms per loop

You can see that for a 30k row df, the latter method is much faster

EdChum
  • 376,765
  • 198
  • 813
  • 562