7

Pandas version 0.25 supports "Named Aggregation" via function agg and namedtuples. You need to pass column, aggregator pairs as the doc describes. It also says:

If your aggregation functions require additional arguments, partially apply them with functools.partial().

I would like to apply this principle to get a weighted average (besides a simple count and average). My input table is

import pandas as pd

t = pd.DataFrame({'bucket':['a', 'a', 'b', 'b', 'b'], 'weight': [2, 3, 1, 4, 3], 
                  'qty': [100, 500, 200, 800, 700]})

and my query fails:

import functools
import numpy as np

t.groupby('bucket').agg(
        NR= ('bucket', 'count'),
        AVG_QTY= ('qty', np.mean),
        W_AVG_QTY= ('qty', functools.partial(np.average, weights='weight'))
   )

with an error message:

TypeError: 1D weights expected when shapes of a and weights differ.

I assume the problem comes from fixing the parameter to be another column instead of a constant? How can I make this work without the workaround that uses apply and a lambda expression that returns a Series?

Ferenc Bodon
  • 310
  • 4
  • 12
  • IIUC, you need to pass `group['weight']` into `np.average`, not the string. – Quang Hoang Dec 11 '19 at 16:30
  • How exactly? Can you provide the exact syntax? – Ferenc Bodon Dec 11 '19 at 16:45
  • 1
    The issue is that `GroupBy.agg` acts on a single column, however a weighted average requires 2 separate columns; one for the values, another for the weights. In that case `GroupBy.apply` is required as you can pass a DataFrame. – ALollz Dec 11 '19 at 16:47

1 Answers1

6

A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply is the correct aggregation method to use. Use pd.concat to join the results.

pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
                                   AVG_QTY = ('qty', np.mean)),
           (t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
             .rename('W_AVG_QTY'))], 
          axis=1)

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5

This can be done with agg, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame using the Series to obtain the weights for each group.

def my_w_avg(s, df, wcol):
    return np.average(s, weights=df.loc[s.index, wcol])

t.groupby('bucket').agg(
        NR= ('bucket', 'count'),
        AVG_QTY= ('qty', np.mean),
        W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
   )

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Thank you! I agree, this solution well demonstrates that you **can** use `functools.partial` with GroupBy.agg to calculate a weighted average and it also demonstrates that you **should not** do it this way ;) – Ferenc Bodon Dec 11 '19 at 19:08