Pandas Groupby Multiple function application with weighted average

Question

I have a DataFrame that look like this:

In [55]:df
Out [55]:

      WEIGHT  TR_YTD  TR_3Y_Ann  Categ1 Categ2 Number
0   0.131214   -0.28       8.49       1      a    0
1   0.052092    1.69      13.70       3      b    1
2   0.045993    2.50        NaN       1      a    2
3   0.041450   -4.57       5.07       1      c    3
4   0.040769    7.64      17.49       2      a    4
5   0.039791    0.07       0.21       1      a    5
6   0.039271   -6.14       8.88       3      a    6
7   0.038340   -8.13        NaN       1      c    7
8   0.038227    9.26      13.78       2      a    8
9   0.033878    0.02      11.45       1      a    9
10  0.029455    5.91      24.86       3      b   10

I have weights, performances and categories (in reality I have many more columns but this is a minimum working example). I would like to:

group first by each categ alone and then by combinations of the first two categories
aggregate several columns with different functions using a dictionary (as shown in Wes McKinney's book)
calculate the weighted average performances using my user defined function
take into into account NaNs in my data
do some sorting in my output

The code:

def get_wavg(df, field):
    return np.average(df[field], weights=df['WEIGHT'])

groups = [df['Categ1'], df['Categ2'], [df['Categ1'],df['Categ2']]]   

funcdict = {'Number':'count',
            'WEIGHT': 'sum', \
            'TR_YTD': lambda x: get_wavg(x,'TR_YTD'), \
            'TR_3Y_Ann': lambda x: get_wavg(x,'TR_3Y_Ann')}


for group in groups:  

    # preparing list to sort dynamically
    # sorting by the first layers of groups (excluding last) and then weights
    groupnames = [x.name for x in group]
    sortinglist = groupnames[:-1]
    sortinglist.append('WEIGHTS')
    ascendinglist = [True]*(len(groupnames)-1)+[False]

    # apply agg functions
    grouped = df.groupby(group)
    grouped = grouped.agg(funcdict)

    # sorting
    grouped  = pd.DataFrame(grouped)
    grouped.reset_index(inplace=True)
    grouped.sort_values(by=sortinglist,ascending=ascendinglist,inplace=True)
    grouped.set_index(groupnames,inplace=True)

    print (grouped)

Problem is that I can't find the correct syntax to make it work with my user defined function. If I use np.nanmean instead it works but doesn't give me the result I am looking for.

What is the correct way to do this?

Could you clarify a bit what is the result you're after? Why is `np.nanmean` failing you? — Ilja Everilä, Sep 13 '16 at 10:14
Sure. np.nanmean is working, but I am after a different result. np,nanmean will compute the average performance of my group as if they were equally weighted, whereas I want it weighted adjusted (using the weight column) — Mth Clv, Sep 13 '16 at 10:16
np.average with a weight argument gives me the result I am looking for but I don't manage to make it work when using multiple functions aggregation — Mth Clv, Sep 13 '16 at 10:17
also I don't know how to make my get_wavg function ignore the nan values when passing a SeriesGroupby object — Mth Clv, Sep 13 '16 at 10:18
There are some issues with your code not related to calculating the weighted average, which make testing it a bit hard, but could you not just `'TR_3Y_Ann': lambda x: (x * df['WEIGHT']).sum()` etc. (not accounting for the weights not adding up to 1)? — Ilja Everilä, Sep 13 '16 at 10:40
There is a problem though with the formula you are proposing because it will mechanically reduce the average performance, since I would multiply by the weight and not the proportion in the group (weight / total weight in group) — Mth Clv, Sep 13 '16 at 10:49
`get_wavg`:s 1st argument is a series like object, not the full blown `df`. Column access on that does not work. In `groups` the first 2 are series objects instead of a list of series objects. That throws `groupnames = [x.name for x in group]` off. It also seems redundant to even hold the full series in the *groups*, just the names would do. — Ilja Everilä, Sep 13 '16 at 11:01
As for normalizing the weights, `(x * df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum()).sum()`, but I admit that is getting mighty ugly. — Ilja Everilä, Sep 13 '16 at 11:06
Can you wrap it up in an answer with the code? I am not sure I understand what you mean. I got your point on the names vs series in the variable groups. you are totally right, thanks. — Mth Clv, Sep 13 '16 at 11:41
There is a problem: df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum() will always be equal to one. Therefore (x * df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum()).sum() will always result in sum of x, and hence not what I am looking for — Mth Clv, Sep 13 '16 at 12:10
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123250/discussion-between-ilja-everila-and-mth-clv). — Ilja Everilä, Sep 13 '16 at 12:18

Pandas Groupby Multiple function application with weighted average

0 Answers0