0

Lets say I construct the following DataFrame in Dask:

import pandas as pd import dask.dataframe as dd

pdf = pd.DataFrame(data    = [1,np.nan,np.nan,1,1,np.nan,1,1,1], 
                   columns = ['X'], 
                   index   = ['a', 'a', 'a', 
                              'b', 'b', 'b',
                              'c', 'c', 'c'])

ddf = dd.from_pandas(pdf, npartitions = 1)

print(ddf.compute())
     X
a  1.0
a  NaN
a  NaN
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

What is the best way to do this?

NOTE: This is a follow-up from this post, which asks the same question but for Pandas. The proposed solutions there work for Pandas, but not for Dask.

hm8
  • 1,381
  • 3
  • 21
  • 41

1 Answers1

1

This works for dask:

ddf1 = ddf.isna().groupby(ddf.index).sum()
ddf2 = ddf1.where(ddf1 <= 1).dropna()
ddf.loc[list(ddf2.index), :].compute()

Output:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0
cosmic_inquiry
  • 2,557
  • 11
  • 23
  • The problem with this is the `list(ddf2.index)` step, which involves implicit computation of the dask DataFrame. Not really noticeable in this example, however I think it would significantly degrade performance on the large datasets that Dask is typically used for? – hm8 Jan 05 '21 at 16:35
  • Ok, I edited my answer to no longer use list. Instead it maps each partition to check if it's in ddf2 index. – cosmic_inquiry Jan 05 '21 at 20:40
  • Ok, this works. But, for whatever reason, it runs *incredibly* slowly on my actual dataset (~100 csv files) when `compute()` is called. About 10x slower than just using the list() method. – hm8 Jan 06 '21 at 16:47
  • So is the list method sufficient?? It's not clear what you want. I've provided two ways to do this, are you saying that neither of them is fast enough? – cosmic_inquiry Jan 06 '21 at 17:22
  • 1
    Fair point, I think I need to reconsider my question. I'm realizing I might not understand how Dask works well enough to come up with a question that makes sense other than "I want my code to go faster". In my head Dask is this module that allows you to save up all of these computation steps and run them all at once at the end. But I suppose the more steps you give it before calling "compute", (such as what you've provided here), the longer its going to end up taking... The answer you provided does answer the question though, so I'll mark it as a solution. – hm8 Jan 06 '21 at 17:40
  • Thanks, and yeah sometimes you've got to play around with different methods if you really want things to be optimized. "Premature optimization is the root of all evil...". I rolled back to the list solution, since you said that was faster for you. – cosmic_inquiry Jan 06 '21 at 23:03