consider the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : [12,10,-2,-4,-2,5,8,7],
'C' : [-5,5,-20,0,1,5,4,-4]})
df
Out[12]:
A B C
0 foo 12 -5
1 bar 10 5
2 foo -2 -20
3 bar -4 0
4 foo -2 1
5 bar 5 5
6 foo 8 4
7 foo 7 -4
Here I need to compute, for each group in A, the sum of elements in B conditional on C being non-negative (i.e. being >=0, a condition based on another column). And vice-versa for C.
However, my code below fails.
df.groupby('A').agg({'B': lambda x: x[x.C>0].sum(),
'C': lambda x: x[x.B>0].sum()})
AttributeError: 'Series' object has no attribute 'B'
So it seems apply
would be preferred (because apply sees all the dataframe I think), but unfortunately I cannot use a dictionary with apply
. So I am stuck. Any ideas?
One not-so-pretty not-so-efficient solution would be to create these conditional variables before running the groupby
, but I am sure this solution does not use the potential of Pandas
.
So, for instance, the expected output for the group bar
and column B
would be
+10 (indeed C equals 5 and is >=0)
-4 (indeed C equals 0 and is >=0)
+5 = 11
Another example:
group foo
and column B
NaN (indeed C equals -5 so I dont want to consider the 12 value in B)
+ NaN (indeed C= -20)
-2 (indeed C=1 so its positive)
+ 8
+NaN = 6
Remark that I use NaNs
instead of zero because another function than a sum would give wrong results (median) if we were to put zeros.
In other words, this is a simple conditional sum where the condition is based on another column. Thanks!