I have a dataframe with various columns and would like to compute mean values of groups under the condition that each group has a minimum number of valid members. I tried the following using groupby, filter, and mean. It seems to work, but I am wondering if there is a more efficient solution?
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : ['one', 'one', 'two', 'three', 'two',
'two', 'two', 'one', 'three', 'one'],
'idprop' : [1., 1., 2., 3., 2., # property corresponding to id
2., 2., 1., 3., 1.],
'x' : np.random.randn(10),
'y' : np.random.randn(10)})
# set a couple of x values to nan
s = df['x'].values
s[s < -0.6] = np.nan
df['x'] = s
g = df.groupby('id', sort=False)
# filter out small group(s) with less than 3 valid values in x
# result is a new dataframe
dff = g.filter(lambda d: d['x'].count() >= 3)
# this means we must group again to obtain the mean value of each filtered group
result = dff.groupby('id').mean()
print result
print type(result)
There is a related question at how to get multiple conditional operations after a Pandas groupby? which, however, only "filters" by row values not by the number of group elements. Converted to my code this would be:
res2 = g.agg({'x': lambda d: df.loc[d.index, 'x'][d >= -0.6].sum()})
As a side question: is there a more efficient way to set values below or above a given threshold to NaN? My brain got twisted when I tried this using loc.