2

I am having issues using the groupby_object.agg() method with functions where I want to change the input parameters. Is there a resource available of function names .agg() accepts, and how to pass parameters to them?

See an example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({'numbers': [1, 2, 3, 2, 1, 3], 
               'colors': ['red', 'white', 'blue', 'red', 'white', np.nan], 
               'weight': [10, 10, 20, 5, 10, 20]})

df['colors'].nunique() # Returns 3 as NaN is not counted
df['colors'].nunique(dropna=False) # Returns 4 as NaN is counted

When I then groupby 'colors' how can I pass the dropna=False parameter with the function?

df.groupby('numbers').agg({'colors': 'nunique', 'weight': 'sum'})
sophocles
  • 13,593
  • 3
  • 14
  • 33
bkeesey
  • 466
  • 4
  • 12

2 Answers2

5

Though pandas has nice syntax for aggregating with dicts and NamedAggs, these can come at a huge efficiency cost. The reason is because instead of using the built-in groupby methods, which are optimized and/or implemented in cython, any .agg(lambda x: ...) or .apply(lambda x: ...) is going to take a much slower path.

What this means is that you should stick with the built-ins you can reference directly or by alias. Only as a last resort should you try to use a lambda:

In this particular case use

df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

Avoid

df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})

This example shows that that while equivalent in output, and a seemingly minor change, there are enormous consequences in terms of performance, especially as the number of groups becomes large.

import perfplot
import pandas as pd
import numpy as np

def built_in(df):
    return df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)])}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

enter image description here


But you want to do multiple aggregations and use different columns

The .groupby() part of a groupby doesn't really do that much; it simply ensures the mapping is correct. So though unintuitive, it is still much faster to aggregate with the built-in separately and concatenate the results in the end than it is to agg with a simpler dict using a lambda.

Here is an example also wanting to sum the weight column, and we can see that splitting is still a lot faster, despite needing to join manually

def built_in(df):
    return pd.concat([df.groupby('numbers')[['colors']].agg('nunique', dropna=False),
                      df.groupby('numbers')[['weight']].sum()], axis=1)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 
                                      'weight': 'sum'})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)]),
                                  'weight': np.random.normal(0,1,n)}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

enter image description here

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    This is a very good explanation and thank you for it. Question based on your answer, ```df.groupby('numbers')[['colors']].agg('nunique', dropna=False)``` this means that the only column aggregated will be ```colors```, and i can change to ```df.groupby('numbers')[['colors','weight']].agg('nunique', dropna=False)``` to get a unique list of both columns. However, what if someone wants to get the ```max()``` on weight column, and ```nunique``` of colors column? – sophocles Feb 05 '21 at 18:57
  • 2
    @sophods ah yeah sorry I was just adding that. You can look at my second example, where I split out the one groupby and then `concat` in the end. If efficiency isn't a huge issue (i.e maybe you have a small df, not one with 20 Million groups) then I'd say readibility might be more important. But in many cases of larger data these slowdowns can be the difference between a 1 minute or 1 hour calculation, so it's often better to take the slightly more complicated coding path. – ALollz Feb 05 '21 at 18:59
2

You can use this link https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

and navigate to the "Aggregation" section where you can find the different methods that are available on grouped data.

In your case, you can pass dropna=False if you use a lambda function:

df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 'weight': 'sum'})

Out[324]: 
         colors  weight
numbers                
1             2      20
2             2      15
3             2      40
sophocles
  • 13,593
  • 3
  • 14
  • 33