Quantile threshold/filter within pandas groupby

Question

I have one categorical variable and two numeric cols:

np.random.seed(123)    
df = pd.DataFrame({'group' : ['a']*10+['b']*10,
                   'var1' : np.random.randn(20),
                   'var2' : np.random.randint(10,size=20)})

I want to find, by group, the mean of var1 after filtering to constrict df to the top-quartile var2 values by group. That is, the threshold for each group would be:

thresh = df.groupby('group')['var2'].quantile(0.75)

Here's a loopy demonstration of what I want to end up with:

for group, frame in df.groupby('group'):
    print(frame[frame.var2 >= frame.var2.quantile(0.75)].var1.mean())
# -1.4713362407192072
# 0.15512098976530683

The result should be a DataFrame with (group, var) columns or a Series indexed by group. I believe the solution may involve .transform/.apply but am stuck there.

You want a dataframe of the group var and the calucated mean or want to create a new column with that value transformed — Bharath M Shetty, Sep 28 '17 at 16:58

score 5 · Answer 1 · edited Sep 28 '17 at 17:03

5

Are you looking for this ?

new = df.groupby('group').apply(lambda x : \
          x[x.var2>=x.var2.quantile(0.75)] \
          .var1.mean()).to_frame()

             0
group          
a     -1.471336
b      0.155121

edited Sep 28 '17 at 17:03

Brad Solomon

38,521
31
149
235

answered Sep 28 '17 at 17:00

Bharath M Shetty

30,075
6
57
108

score 1 · Accepted Answer · answered Sep 28 '17 at 17:02

Similar to @Bharathshetty's answer but making the function a little more flexible rather than incorporating it to a lambda:

def thresh_filter(obj, sort_var, tgt_var, q=0.75, stat='mean'):
    thresh = obj[sort_var].quantile(q=q)
    return getattr(obj[obj[sort_var] >= thresh][tgt_var], stat)()

print(df.groupby('group').apply(lambda x: thresh_filter(x, 'var2', 'var1')))
group
a   -1.47134
b    0.15512
dtype: float64

Quantile threshold/filter within pandas groupby

2 Answers2