1

Given a data frame, I would like to write a function to calculate the weighted average by group for a given set of columns.

For example, I have a data frame

mydf = pd.DataFrame({
'group' : np.array(['A', 'B', 'C', 'D', 'E']*20),
'weight' : np.array(list(range(1,6))*20),
'x1' : np.random.uniform(100, 200, 100),
'x2' : np.random.uniform(200, 300, 100),
'x3' : np.random.uniform(300, 400, 100),
...
'x999': np.random.uniform(99900, 100000, 100),
})

I could use the following method to calculate the weighted average for each variable

wm = mydf \
.groupby('group') \
.apply(
    lambda x: pd.Series({
        'x1' : np.average(x['x1'], weights=x['weight']),
        'x2' : np.average(x['x2'], weights=x['weight']),
        'x3' : np.average(x['x3'], weights=x['weight']),
        ...
        'x999' : np.average(x['x999'], weights=x['weight']),
            })
) \
.reset_index()

The problem is I need to define a function to compute weighted avg for an arbitrary list of variables ,e.g. ['x11','x12','x16','x77'].

I know I can create a loop to process one variable at a time and then merge the results together, but it would be awfully inefficient. Is there a way to programmatically process a list of variables all at once inside the function?

PingPong
  • 355
  • 2
  • 11

1 Answers1

2

np.average with weights is pretty simple, you can rewrite it as follows:

my_list = ['x1','x2','x3']
(mydf.groupby('group')[my_list+['weight']]
     .apply(lambda x: x[my_list].mul(x['weight'], axis=0).sum() / x['weight'].sum())
)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74