0

I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.

My code:

# Import
import pandas as pd
import numpy as np
from dfply import *

# Create data frame and mask it
df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)

Here is the oringal data frame, df:

       a    b    c
    0  NaN  6.0  5
    1  2.0  7.0  4
    2  3.0  8.0  3
    3  4.0  9.0  2
    4  5.0  NaN  1

And here is the result of the piped mask, df2:

         a    b    c
      0  NaN  6.0  5
      4  5.0  NaN  1

However, I expect this instead:

         a    b    c
      0  NaN  6.0  5
      1  2.0  7.0  4
      2  3.0  8.0  3
      3  4.0  9.0  2

Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?

By the way, I also tried np.logical_or():

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)

But this resulted in error:

mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
Neko
  • 11
  • 4

2 Answers2

0

Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))

print(df2)

     a    b  c
0  NaN  6.0  5
1  2.0  7.0  4
2  3.0  8.0  3
3  4.0  9.0  2
CurlyW
  • 61
  • 1
  • 5
  • 3
    Can you please explain your answer? – DaFois Jul 25 '18 at 15:00
  • Thank you very much CurlyW. That works. But do you know why this original logic doesn't work...: `df2 = (df >> mask((X.a.isnull()) | ~(X.b.isnull())))` ...? It seems `dfply` ignores the `~` after `|`. I'd like to be confident that I can apply standard boolean operators in a dfply mask, since the equivalent of notnull() may not always exist. By the way, I tried `or`, but this resulted in an error. – Neko Jul 26 '18 at 11:41
0

How about filter_by?

df >> filter_by((X.a.isnull()) | (X.b.isnull()))
Frightera
  • 4,773
  • 2
  • 13
  • 28
  • Hi @loveactualry. The challenge here is that dfply isn't recognising the ~ on the right side of the OR (|) in my original code. In your suggestion, when I change (X.b.isnull())) to ~(X.b.isnull())) I still get a result of: a b c 0 NaN 6.0 5 4 5.0 NaN 1 ...which is not what I expect. Rather, I expect: a b c 0 NaN 6.0 5 1 2.0 7.0 4 2 3.0 8.0 3 3 4.0 9.0 2 – Neko Apr 06 '21 at 11:22
  • Oh, sorry... I'll try to do again. – loveactualry Apr 07 '21 at 02:00