0

I want to use the pandas GroupBy agg "computations/descriptive stats" functions instead of applying defined/custom or lambda function. I found the former method (e.g. df.groupby(groupid).agg({'col1':'mean'})) is faster than later methods (e.g. df.groupby(groupid).agg(name='col1',lambda x:np.nanmean(x)). However, I am having difficulty applying former method on multiple columns just as I needed.

Suppose I have pandas dataframe like this and I want to take (just example)

  1. number of transactions per user
  2. 2nd last date of transaction
  3. average amount of transaction
df = pd.DataFrame({"user_id":["u123"] * 5, 
                   "order_date":["2021-01-01","2021-01-02","2021-01-03","2021-01-04","2021-01-04"], 
                   "item_price":[123,234,321,345,0]})

I can use lambda functions within groupby.agg, for example,

df.groupby('user_id').agg(count=('user_id', lambda x: len(x)),
                          last2date=('order_date', lambda x: x.tail(2)[-1:]),
                          avgprice=('item_price', lambda x: np.nanmean(x))).reset_index()

However, I do not want to use lambda function but want to use the tail or nth function and not sure how to pass n.

df.groupby(['user_id'])['order_date'].tail(2)[-1:] ## this works but I want agg on other columns too
df.groupby(['user_id']).agg({'user_id':'size', 'order_date':'nth'}) ## this fails how to pass n?
  • I googled many sources but couldn't find solution to this particular situation.
camel_case
  • 63
  • 7
  • https://stackoverflow.com/questions/66068515/how-do-you-change-input-parameters-of-pandas-groupby-agg-function. The solution is to basically separate the aggregations and `concat` the results. It's unintuitive, but the "groupby" part of the `groupby` doesn't really do anything, or take time, so it's fine to repeat that in favor of avoiding the slow lambda and allowing you to use defaults with specific aggregations. The specific example there uses `nunique` with the arg `dropna`, but is completely relevant to using `tail` with `n` here – ALollz Oct 21 '21 at 16:57
  • @ALollz - thanks its make sense the built in functions are faster. I am trying to use built in functions but with argument. I couldn't find any post that cover how to use builtin functions across multiple columns with arguments for each functions :<. + yeap also thought about making seperate groupby and merging but wanted to avoid if possible :) – camel_case Oct 21 '21 at 17:00
  • Actually I'm under the impression that `df.groupby()` does do something, e.g. indexing. Creating a groupby object `g = df.groupby(...)` and calls `g.agg` several times might be an edge faster. – Quang Hoang Oct 21 '21 at 17:02
  • Well it does get complicated because the return of a `tail` aggregation doesn't reduce (i.e. you get a DataFrame) but things like `size` do reduce to a single scalar per group. So you will need to define _how_ to join those results back in the end, as you can imagine there are multiple possibilities – ALollz Oct 21 '21 at 17:02
  • 1
    @QuangHoang I generally thought it just checked the mapping was _okay_. Perhaps I'm missing something obvious, but if you `pd.DataFrame(np.random.randint(0,10000,10**4))` and then time `df.groupby(0)` you get essentially a few hundred us, even if you make the same DataFrame have 10**8 rows. So it does _something_ but not sure it scales in any meaningful way or eats up any major chunk of time – ALollz Oct 21 '21 at 17:06
  • 2
    It doesn't do much if you just measure the creation of the groupby object. But if yo call some methods, e.g. `mean`, the indexing will be stored inside the object. `df.groupby(0).mean();df.groupby(0).max()` is significantly slower than `g = df.groupby(0); g.mean(); g.max()`. – Quang Hoang Oct 21 '21 at 17:48

0 Answers0