Keep indices in Pandas DataFrame with a certain number of non-NaN entires

Question

Lets say I have the following dataframe:

df1 = pd.DataFrame(data    = [1,np.nan,np.nan,1,1,np.nan,1,1,1], 
                   columns = ['X'], 
                   index   = ['a', 'a', 'a', 
                              'b', 'b', 'b',
                              'c', 'c', 'c'])
print(df1)
     X
a  1.0
a  NaN
a  NaN
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

What is the best way to do this? Ideally I want something that works with Dask too, although usually if it works with Pandas it also works in Dask.

score 10 · Accepted Answer · answered Jan 05 '21 at 00:58

10

Let us try filter

out = df.groupby(level=0).filter(lambda x : x.isna().sum()<=1)
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or we do isin

df[df.index.isin(df.isna().sum(level=0).loc[lambda x : x['X']<=1].index)]
     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

answered Jan 05 '21 at 00:58

BENY

317,841
20
164
234

Both of these work in Pandas, however it appears to be the rare case (at least in my experience) where it does not work in Dask. Neither the "filter" function not the level-based summing appear to be implemented in Dask. I'll mark the answer as a solution though, since it mostly answered the question as posed. I originally chose to ask about Pandas instead of Dask since people typically have more experience with Pandas, and again the solutions are usually interchangeable... – hm8 Jan 05 '21 at 01:17

cs95 · Answer 2 · 2021-01-05T01:32:10.993

6

As another option, let's try filtering via GroupBy.transform and boolean indexing:

df1[df1['X'].isna().groupby(df1.index).transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

Or, almost the same way,

df1[df1.assign(X=df1['X'].isna()).groupby(level=0)['X'].transform('sum') <= 1]

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

~~You might have a good shot at getting this to work with Dask too.~~

edited Jan 05 '21 at 01:32

answered Jan 05 '21 at 01:20

cs95

379,657
97
704
746

Both of these work in Pandas, but result in the following error for Dask: `ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.' – hm8 Jan 05 '21 at 01:25

score 5 · Answer 3 · answered Jan 09 '21 at 19:23

5

I am new to dask , looked at some examples and docs , however the following seems to work;

from dask import dataframe as dd 
sd = dd.from_pandas(df1, npartitions=3)

#converts X to boolean checking for isna() and the groupby on index and sum
s = sd.X.isna().groupby(sd.index).sum().compute()

#using the above we can boolean index to check if sum is less than 2 , then use loc

out_dd = sd.loc[list(s[s<2].index)]

out_dd.head(6,npartitions=-1)

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

answered Jan 09 '21 at 19:23

anky

74,114
11
41
70

1

Thank You for posting this. I was able to use some of your code to test out my code on Dask. And your code works perfectly – Joe Ferndz Jan 10 '21 at 10:03
1

Since OP never responded, I'll award the bounty to this answer since it was the first one posted after the bounty was set. – cs95 Jan 14 '21 at 09:36
1

@cs95 Thank you :) thanks to the bounty I got a chance to look at Dask for the first time, so thank you. – anky Jan 14 '21 at 10:00

score 1 · Answer 4 · answered Jan 05 '21 at 02:15

1

Here is another way:

         dft.loc[dft.groupby(dft.index)['X'].apply(lambda x : x.notnull().sum() > 1)]


                X
           b    1.0
           b    1.0
           b    NaN
           c    1.0
           c    1.0
           c    1.0

answered Jan 05 '21 at 02:15

theDBA

239
1
5

Joe Ferndz · Answer 5 · 2021-01-10T10:19:48.597

I am new to Dask. I don't even have Dask installed on my laptop. I read through the documentation of Dask and found that Dask can do reset_index().

If that's allowed, here's how to approach the problem.

Step 1:

df1 = df.reset_index()

df1 will give you:

>>> df1
  index    X
0     a  1.0
1     a  NaN
2     a  NaN
3     b  1.0
4     b  1.0
5     b  NaN
6     c  1.0
7     c  1.0
8     c  1.0

Now you have the index and value of X.

Step 2:

To find out which index value has 2 or more nulls, you can do:

df1.X.isnull().groupby([df1['index']]).sum().astype(int) < 2

The result of this will be:

index
a    False
b     True
c     True
Name: X, dtype: bool

Step 3:

You now apply this back to the original dataframe df and the filtered records will be those with NaNs less than 2.

df.loc[(df2.X.isnull().groupby([df2['index']]).sum().astype(int) < 2)]

The result of this will be:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

I hope Dask allows you to do this. If it does, this will be a way to get the result.

Very bold of you to attempt an answer for a library you don't even have installed — cs95, Jan 10 '21 at 09:29
Installed Dask, tried it out and it works. Thats nice. Looking at @anky's solution, my solution looks similar. Anky is using the list of the index to do get similar output — Joe Ferndz, Jan 10 '21 at 10:01
I ended up almost making my Dask code look like Anky. So I would go with Anky's code — Joe Ferndz, Jan 10 '21 at 10:17

Mykola Zotko · Answer 6 · 2021-01-14T11:24:20.953

1

You can use loc with a Series of booleans:

df.loc[df['X'].notna().groupby(level=0).sum().ge(2)]

In the first step we get the Series for filtering:

mask = df['X'].notna().groupby(level=0).sum().ge(2)

Result:

a    False
b     True
c     True
Name: X, dtype: bool

In the second step we filter using loc:

df.loc[mask]

Result:

     X
b  1.0
b  1.0
b  NaN
c  1.0
c  1.0
c  1.0

edited Jan 14 '21 at 11:24

answered Jan 14 '21 at 11:10

Mykola Zotko

15,583
3
71
73

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

6 Answers6

Linked