I want to use the pandas GroupBy agg "computations/descriptive stats" functions instead of applying defined/custom or lambda function. I found the former method (e.g. df.groupby(groupid).agg({'col1':'mean'})
) is faster than later methods (e.g. df.groupby(groupid).agg(name='col1',lambda x:np.nanmean(x))
. However, I am having difficulty applying former method on multiple columns just as I needed.
Suppose I have pandas dataframe like this and I want to take (just example)
- number of transactions per user
- 2nd last date of transaction
- average amount of transaction
df = pd.DataFrame({"user_id":["u123"] * 5,
"order_date":["2021-01-01","2021-01-02","2021-01-03","2021-01-04","2021-01-04"],
"item_price":[123,234,321,345,0]})
I can use lambda functions within groupby.agg, for example,
df.groupby('user_id').agg(count=('user_id', lambda x: len(x)),
last2date=('order_date', lambda x: x.tail(2)[-1:]),
avgprice=('item_price', lambda x: np.nanmean(x))).reset_index()
However, I do not want to use lambda function but want to use the tail
or nth
function and not sure how to pass n
.
df.groupby(['user_id'])['order_date'].tail(2)[-1:] ## this works but I want agg on other columns too
df.groupby(['user_id']).agg({'user_id':'size', 'order_date':'nth'}) ## this fails how to pass n?
- I googled many sources but couldn't find solution to this particular situation.