7

Lets say I have the following dataframe:

df1 = pd.DataFrame(data    = [1,np.nan,np.nan,1,1,np.nan,1,1,1], 
                   columns = ['X'], 
                   index   = ['a', 'a', 'a', 
                              'b', 'b', 'b',
                              'c', 'c', 'c'])
print(df1)
     X
a  1.0
a  NaN
a  NaN
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.

hm8
  • 1,381
  • 3
  • 21
  • 41

6 Answers6

10

Let us try filter

out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or we do isin

df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Both of these work in Pandas, however it appears to be the rare case (at least in my experience) where it does not work in Dask. Neither the "filter" function not the level-based summing appear to be implemented in Dask. I'll mark the answer as a solution though, since it mostly answered the question as posed. I originally chose to ask about Pandas instead of Dask since people typically have more experience with Pandas, and again the solutions are usually interchangeable... – hm8 Jan 05 '21 at 01:17
6

As another option, let's try filtering via GroupBy.transform and boolean indexing:

df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or, almost the same way,

df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

You might have a good shot at getting this to work with Dask too.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Both of these work in Pandas, but result in the following error for Dask: `ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.' – hm8 Jan 05 '21 at 01:25
5

I am new to dask , looked at some examples and docs , however the following seems to work;

from dask import dataframe as dd 
sd = dd.from_pandas(df1, npartitions=3)

#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()

#using the above we can boolean index to check if sum is less than 2 , then use loc

out_dd = sd.loc[list(s[s<2].index)]

out_dd.head(6,npartitions=-1)

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Thank You for posting this. I was able to use some of your code to test out my code on Dask. And your code works perfectly – Joe Ferndz Jan 10 '21 at 10:03
  • 1
    Since OP never responded, I'll award the bounty to this answer since it was the first one posted after the bounty was set. – cs95 Jan 14 '21 at 09:36
  • 1
    @cs95 Thank you :) thanks to the bounty I got a chance to look at Dask for the first time, so thank you. – anky Jan 14 '21 at 10:00
1

Here is another way:

         dft.loc[dft.groupby(dft.index)['X'].apply(lambda x : x.notnull().sum() > 1)]


                X
           b    1.0
           b    1.0
           b    NaN
           c    1.0
           c    1.0
           c    1.0
theDBA
  • 239
  • 1
  • 5
1

I am new to Dask. I don't even have Dask installed on my laptop. I read through the documentation of Dask and found that Dask can do reset_index().

If that's allowed, here's how to approach the problem.

Step 1:

df1 = df.reset_index()

df1 will give you:

>>> df1
  index    X
0     a  1.0
1     a  NaN
2     a  NaN
3     b  1.0
4     b  1.0
5     b  NaN
6     c  1.0
7     c  1.0
8     c  1.0

Now you have the index and value of X.

Step 2:

To find out which index value has 2 or more nulls, you can do:

df1.X.isnull().groupby([df1['index']]).sum().astype(int) < 2

The result of this will be:

index
a    False
b     True
c     True
Name: X, dtype: bool

Step 3:

You now apply this back to the original dataframe df and the filtered records will be those with NaNs less than 2.

df.loc[(df2.X.isnull().groupby([df2['index']]).sum().astype(int) < 2)]

The result of this will be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I hope Dask allows you to do this. If it does, this will be a way to get the result.

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
  • 2
    Very bold of you to attempt an answer for a library you don't even have installed – cs95 Jan 10 '21 at 09:29
  • Installed Dask, tried it out and it works. Thats nice. Looking at @anky's solution, my solution looks similar. Anky is using the list of the index to do get similar output – Joe Ferndz Jan 10 '21 at 10:01
  • 1
    I ended up almost making my Dask code look like Anky. So I would go with Anky's code – Joe Ferndz Jan 10 '21 at 10:17
1

You can use loc with a Series of booleans:

df.loc[df['X'].notna().groupby(level=0).sum().ge(2)]

In the first step we get the Series for filtering:

mask = df['X'].notna().groupby(level=0).sum().ge(2)

Result:

a    False
b     True
c     True
Name: X, dtype: bool

In the second step we filter using loc:

df.loc[mask]

Result:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73