pandas groupby: efficient conditional aggregation?

Question

I have a dataframe with various columns and would like to compute mean values of groups under the condition that each group has a minimum number of valid members. I tried the following using groupby, filter, and mean. It seems to work, but I am wondering if there is a more efficient solution?

import pandas as pd
import numpy as np

df = pd.DataFrame({'id' : ['one', 'one', 'two', 'three', 'two',
                           'two', 'two', 'one', 'three', 'one'],
                   'idprop' : [1., 1., 2., 3., 2.,   # property corresponding to id
                               2., 2., 1., 3., 1.],
                    'x' : np.random.randn(10),
                    'y' : np.random.randn(10)})

# set a couple of x values to nan
s = df['x'].values
s[s < -0.6] = np.nan
df['x'] = s

g = df.groupby('id', sort=False)
# filter out small group(s) with less than 3 valid values in x
# result is a new dataframe
dff = g.filter(lambda d: d['x'].count() >= 3)

# this means we must group again to obtain the mean value of each filtered group
result = dff.groupby('id').mean()
print result
print type(result)

There is a related question at how to get multiple conditional operations after a Pandas groupby? which, however, only "filters" by row values not by the number of group elements. Converted to my code this would be:

res2 = g.agg({'x': lambda d: df.loc[d.index, 'x'][d >= -0.6].sum()})

As a side question: is there a more efficient way to set values below or above a given threshold to NaN? My brain got twisted when I tried this using loc.

Answer to your side question: `df.loc[df['x'] < -0.6, 'x'] = np.nan` — IanS, Jun 02 '16 at 13:38
I'm tempted to say that `df.filter(...).groupby('id').mean()` is the most efficient way to get what you want. — jonchar, Jun 02 '16 at 15:03

score 2 · Accepted Answer · answered Jul 24 '17 at 13:06

You can achieve this using the groupby apply function:

def mean_cond(dfg):
    if dfg['x'].count() >= 3:
        return dfg.mean()
    return None

print df.groupby('id').apply(mean_cond).dropna()

The advantage here is that the grouping process is performed only once, which might be more efficient than running another groupby after the filter. The only problem, perhaps, is that this causes groups which don't meet the criteria to be presented as NaNs in the resulting table. This is easily solved by adding the dropna command in the end.

pandas groupby: efficient conditional aggregation?

1 Answers1