6

consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B' : [12,10,-2,-4,-2,5,8,7],
                   'C' : [-5,5,-20,0,1,5,4,-4]})

df
Out[12]: 
     A   B   C
0  foo  12  -5
1  bar  10   5
2  foo  -2 -20
3  bar  -4   0
4  foo  -2   1
5  bar   5   5
6  foo   8   4
7  foo   7  -4

Here I need to compute, for each group in A, the sum of elements in B conditional on C being non-negative (i.e. being >=0, a condition based on another column). And vice-versa for C.

However, my code below fails.

df.groupby('A').agg({'B': lambda x: x[x.C>0].sum(),
                     'C': lambda x: x[x.B>0].sum()})      

AttributeError: 'Series' object has no attribute 'B'

So it seems apply would be preferred (because apply sees all the dataframe I think), but unfortunately I cannot use a dictionary with apply. So I am stuck. Any ideas?

One not-so-pretty not-so-efficient solution would be to create these conditional variables before running the groupby, but I am sure this solution does not use the potential of Pandas.

So, for instance, the expected output for the group bar and column B would be

+10 (indeed C equals 5 and is >=0)
-4 (indeed C equals 0 and is >=0)
+5 = 11

Another example: group foo and column B

NaN (indeed C equals -5 so I dont want to consider the 12 value in B)
+ NaN   (indeed C= -20)
-2    (indeed C=1 so its positive)
+ 8
+NaN = 6

Remark that I use NaNs instead of zero because another function than a sum would give wrong results (median) if we were to put zeros.

In other words, this is a simple conditional sum where the condition is based on another column. Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

3 Answers3

10

Another alternative is to precompute the values you will need before using groupby/agg:

import numpy as np
import pandas as pd

N = 1000
df = pd.DataFrame({'A' : np.random.choice(['foo', 'bar'], replace=True, size=(N,)),
                   'B' : np.random.randint(-10, 10, size=(N,)),
                   'C' : np.random.randint(-10, 10, size=(N,))})

def using_precomputation(df):
    df['B2'] = df['B'] * (df['C'] >= 0).astype(int)
    df['C2'] = df['C'] * (df['B'] >= 0).astype(int)
    result = df.groupby('A').agg({'B2': 'sum', 'C2': 'sum'})   
    return result.rename(columns={'B2':'B', 'C2':'C'})

Let's compare using_precomputation with using_index and using_apply:

def using_index(df):
    result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 
                                  'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()}) 
    return result.rename(columns={'B':'C', 'C':'B'})

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return pd.Series({'B':b, 'C':c})

def using_apply(df):
    return df.groupby('A').apply(my_func)

First, let's check that they all return the same result:

def is_equal(df, func1, func2):
    result1 = func1(df).sort_index(axis=1)
    result2 = func2(df).sort_index(axis=1)
    assert result1.equals(result2)
is_equal(df, using_precomputation, using_index)
is_equal(df, using_precomputation, using_apply)

Using the 1000-row DataFrame above:

In [83]: %timeit using_precomputation(df)
100 loops, best of 3: 2.45 ms per loop

In [84]: %timeit using_index(df)
100 loops, best of 3: 4.2 ms per loop

In [85]: %timeit using_apply(df)
100 loops, best of 3: 6.84 ms per loop

Why is using_precomputation faster?

Precomputation allows us to take advantage of fast vectorized arithmetic on entire columns and allows the aggregation function to be the simple builtin sum. Builtin aggregators tend to be faster than custom aggregation functions such as the ones used here (based on jezrael's solution):

def using_index(df):
    result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 
                                  'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()}) 
    return result.rename(columns={'B':'C', 'C':'B'})

Moreover, the less work you have to do on each little group, the better off you are performance-wise. Having to do double-indexing for each group hurts performance.

Also a killer to performance is using groupby/apply(func) where the func returns a Series. This forms one Series for each row of the result, and then causes Pandas to align and concatenate all the Series. Since typically the Series tend to be short and the number of Series tends to be big, concatenating all these little Series tends to be slow. Again, you tend to get the best performance out of Pandas/NumPy when performing vectorized operations on big arrays. Looping through lots of tiny results kills performance.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • thank you, very instructive. dont you mean this form one series for each COLUMN of the result? – ℕʘʘḆḽḘ Mar 31 '16 at 18:54
  • 1
    Re: *This forms one Series for each row of the result*: `groupby/agg(func)` calls `func` once for each group and `func` returns a Series. Loosely speaking, the Series gets laid out horizontally and then concatenated. So each Series corresponds to a row of the result returned by `groupby/agg(func)`. – unutbu Mar 31 '16 at 18:57
3

I think you can use:

print df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 
                           'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})  
      C   B
A          
bar  11  10
foo   6  -5  

Better for understanding are custom function what is same as above:

def f(x):
    s = df.loc[x.index, 'C']
    return s[x>=0].sum()
def f1(x):
    s = df.loc[x.index, 'B']
    return s[x>=0].sum()


print df.groupby('A').agg({'B': f, 'C': f1})
      C   B
A          
bar  11  10
foo   6  -5 

EDIT:

root's solution is very nice, but it can be better:

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return pd.Series({'C':b, 'B':c})

result = df.groupby('A').apply(my_func)
      C   B
A          
bar  11  10
foo   6  -5
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks jezrael but I dont think this works. I am conditioning on the **other variable** in the sum. – ℕʘʘḆḽḘ Mar 31 '16 at 15:10
  • thank you so much. can you just explain the mechanism here? what does `df.loc[x.index, 'C'][df.loc[x.index,'C'] > 0]` exactly do? – ℕʘʘḆḽḘ Mar 31 '16 at 15:20
  • 1
    Is the output correct? For example, shouldn't (foo, 'B') be 6? Or am I misinterpreting the question? – root Mar 31 '16 at 15:30
  • yes, I think the output is not correct actually. Let me make the question more clear. But I believe the solution will be very close – ℕʘʘḆḽḘ Mar 31 '16 at 15:31
  • 1
    thanks jezrael, If I am not mistaken, didnt you erroneously switch B and C in your second example? – ℕʘʘḆḽḘ Mar 31 '16 at 18:47
  • @jezrael In SQL we use `COUNT(DISTINCT CASE WHEN tb.A > tb.B THEN tb.C ELSE NULL END) AS cnt`, how to reach this? – mingchau Jun 18 '19 at 10:20
  • @mingchau - Without data not easy, but something like `out = df.loc[df.A > df.B, 'C'].nunique()` shuld working, if not, please post new question with sample data, expected output and your code, what you try. – jezrael Jun 18 '19 at 10:23
1

You can use apply to return a tuple containing the fields you want, and then use zip to unpack them.

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return b, c

# Perform the groupby aggregation.
result = df.groupby('A').apply(my_func).to_frame()

# Unpack the resulting tuple and drop the extra column.
result['B'], result['C'] = zip(*result[0])
result.drop(0, axis=1, inplace=True)

This yields the following output:

      B   C
A          
bar  11  10
foo   6  -5
root
  • 32,715
  • 6
  • 74
  • 87