Pandas - Writing a re-usable Python function to aggregate and calculate weighted average

Question

I have a dataframe, for which I could calculate the weighted mean grouped by a given column.

I would do it like this

# Weighted mean by notional
def wn(x):
    y = x.copy().dropna()
    if y.empty:
        return np.nan
    else:
        return np.average(y, weights=trade_basket.loc[y.index, 'NotionalTraded'])


trading_summary_table_avg = trade_basket[columns_to_wn_avg].groupby(trade_basket.Side).agg(wn)

However, I'm trying to make this transformation into a function that would take a dataframe, a list of columns to calculate, the weights, and the column to aggregate by and return the result... because I will later have to calculate again the weighted mean, aggregating by different columns, using a different col as the weight, etc... without having to create a new function as above or a lambda

I'm having a hard time making this into a function because I can't find a way to pass those arguments into my function.

EDIT

I ended up doing the below:

def calc_weighted_avg(basket, weight_col, used_col, agg_col):
    # Weighted mean by weight_col
    def wn(x):
        y = x.copy().dropna()
        if y.empty:
            return np.nan
        else:
            return np.average(y, weights=basket.loc[y.index, weight_col])

    weight_avg_table = basket[used_col].groupby(basket[agg_col]).agg(wn)
    return weight_avg_table

looks like you are using a `np.array` in your function. If you want to use dataframe you should use `pandas.DataFrame`, you can create a dataframe by passing an `np.array` to it, see [this question](https://stackoverflow.com/q/20763012/2536357) — tuned, Jul 19 '21 at 13:50
Not sure what you mean... the above works, but the function is not parameterisable. I am looking for a solution to make it so — iamquitenew, Jul 19 '21 at 14:00
You could use a decorator to pass the column name when binding the function or make them normal parameters and pass them via `functools.partial` to conform to pandas apply signature — maow, Jul 19 '21 at 14:08
@maow wow, thanks for that, quite a bit went over my head though, mind providing an example? — iamquitenew, Jul 19 '21 at 14:31
I wrote out a few examples to what I meant. I hope it helps :) Forget about the decorator pattern though. This is better for use cases where you want to hide boilerplate code that you specify once and don't change afterwards. I thought you might be able to use this, but I was wrong :D — maow, Jul 19 '21 at 18:57

score 0 · Accepted Answer · answered Jul 19 '21 at 18:53

If you are looking for example e.g. https://pbpython.com/weighted-average.html is quite useful.

To make my remark a bit more practical. First some data:

a = np.random.rand(8,2)
df = pd.DataFrame(a, columns=['Dummy_1', 'Dummy_2'])
df['NotionalTraded'] = [1,0,1,0.5,1,0,1,0.5]
df['index'] = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c']

    Dummy_1   Dummy_2  NotionalTraded index
a  0.793964  0.027435             1.0     a
a  0.719204  0.426740             0.0     a
a  0.821932  0.875693             1.0     a
a  0.007710  0.447273             0.5     a
b  0.138741  0.779036             1.0     b
b  0.936195  0.208430             0.0     b
c  0.718698  0.312217             1.0     c
c  0.508001  0.731941             0.5     c

For something like a normal mean you can do something like

In [74]: df.groupby('index')[['Dummy_1', 'Dummy_2']].agg(np.mean)
Out[74]: 
        Dummy_1   Dummy_2
index                    
a      0.585703  0.444285
b      0.537468  0.493733
c      0.613349  0.522079

The function in agg gets applied to each column for each group in succession. However, if you have a function with multiple parameters, you only get an error.

In [75]: def wn(x, weights):
    ...:     pass
    ...: 

In [76]: df.groupby('index')[['Dummy_1', 'Dummy_2']].agg(wn)
TypeError: wn() missing 1 required positional argument: 'weights'

pandas will forward additional arguments to the aggregation function, but in you case it is really hard. The reason is, that the columns are actually filtered out in the [['Dummy_1', 'Dummy_2']] and agg will iterate over the columns of each group, so that you can't really write functions which use multiple columns.

There is another function called apply which is much better suited for you usecase. For each group it will take the whole DataFrame at once and pass it to the function. The solution in the blog article above translates to

In [79]: df.groupby('index').apply(lambda x: np.average(x[['Dummy_1', 'Dummy_2']], weights=x['NotionalTraded'], axis=0))
Out[79]: 
index
a    [0.6479006083188169, 0.45070588200164324]
b     [0.13874076610128128, 0.779035685092115]
c    [0.6484653712088316, 0.45212507743936053]
dtype: object

or as you mentioned to avoid lambdas

In [80]: def wn(x, weight_col, used_col):
    ...:     return np.average(x[used_col], weights=x[weight_col], axis=0)
    ...: 

In [81]: df.groupby('index').apply(wn, 'NotionalTraded', ['Dummy_1', 'Dummy_2'])
Out[81]: 
index
a    [0.6479006083188169, 0.45070588200164324]
b     [0.13874076610128128, 0.779035685092115]
c    [0.6484653712088316, 0.45212507743936053]
dtype: object

In this case apply forwards the args to the aggregation function. If you don't even want to specify this, you can create a "dummy" function with the args already "filled out". One way is

In [87]: wn_by_notionalTraded = functools.partial(wn, weight_col='NotionalTraded', used_col=['Dummy_1', 'Dummy_2'])

In [88]: df.groupby('index').apply(wn_by_notionalTraded)
Out[88]: 
index
a    [0.6479006083188169, 0.45070588200164324]
b     [0.13874076610128128, 0.779035685092115]
c    [0.6484653712088316, 0.45212507743936053]
dtype: object

This achieves the same effect you would with a factory function

In [90]: def wn_factory(weight_col, used_col):
    ...:     return lambda x: np.average(x[used_col], weights=x[weight_col], axis=0)
    ...: 
    ...: 

In [91]: wn = wn_factory('NotionalTraded', ['Dummy_1', 'Dummy_2'])

In [92]: df.groupby('index').apply(wn)
Out[92]: 
index
a    [0.6479006083188169, 0.45070588200164324]
b     [0.13874076610128128, 0.779035685092115]
c    [0.6484653712088316, 0.45212507743936053]
dtype: object

The thoughts that went into your solution are basically the same. The main challenge is that a weighted average by definition requires the weights and agg assumes you can aggregate each column independently.

Pandas - Writing a re-usable Python function to aggregate and calculate weighted average

1 Answers1