I am an R user learning how to use Python's dfply
, the Python equivalent to R's dplyr
. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the oringal data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?
By the way, I also tried np.logical_or()
:
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__