Faster way of computing the mean with pandas groupy + apply and condensing groups

Question

I want to groupby two values and if the group contains more than one element, return only the first row of the group with the value replaced by the mean for the group. If there is only one element, I want to return directly. My code looks like this:

final = df.groupby(["a", "b"]).apply(condense).drop(['a', 'b'], axis=1).reset_index()

def condense(df):
    if df.shape[0] > 1:
        mean = df["c"].mean()
        record = df.iloc[[0]]
        record["c"] = mean
        return(record)
    else:
        return(df)

And the df looks something like this:

a      b     c   d
"f"   "e"    2   True
"f"   "e"    3   False
"c"   "a"    1   True

As the data frame is quite large, I have 73800 groups and the computation of the whole groupby + apply takes about a minute. This is far too long. Is there a way to make it run faster?

jezrael · Accepted Answer · 2020-10-19T12:37:58.637

2

I think mean of one value is same like mean of multiple values, so you can solution simplify by GroupBy.agg with mean for column c and all another values aggregate by first:

d = dict.fromkeys(df.columns.difference(['a','b']), 'first')
d['c'] = 'mean'
print (d)
{'c': 'mean', 'd': 'first'}

df = df.groupby(["a", "b"], as_index=False).agg(d)
print (df)
   a  b    c     d
0  c  a  1.0  True
1  f  e  2.5  True

edited Oct 19 '20 at 12:37

answered Oct 19 '20 at 12:31

jezrael

822,522
95
1,334
1,252

1

Wow, that just took the time from 58 seconds to 0.1, thanks! – LizzAlice Oct 19 '20 at 12:55

Faster way of computing the mean with pandas groupy + apply and condensing groups

1 Answers1