Lets say I construct the following DataFrame in Dask:
import pandas as pd import dask.dataframe as dd
pdf = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1],
columns = ['X'],
index = ['a', 'a', 'a',
'b', 'b', 'b',
'c', 'c', 'c'])
ddf = dd.from_pandas(pdf, npartitions = 1)
print(ddf.compute())
X
a 1.0
a NaN
a NaN
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:
X
b 1.0
b 1.0
b NaN
c 1.0
c 1.0
c 1.0
What is the best way to do this?
NOTE: This is a follow-up from this post, which asks the same question but for Pandas. The proposed solutions there work for Pandas, but not for Dask.