1

I have a DataFrame that look like this:

In [55]:df
Out [55]:

      WEIGHT  TR_YTD  TR_3Y_Ann  Categ1 Categ2 Number
0   0.131214   -0.28       8.49       1      a    0
1   0.052092    1.69      13.70       3      b    1
2   0.045993    2.50        NaN       1      a    2
3   0.041450   -4.57       5.07       1      c    3
4   0.040769    7.64      17.49       2      a    4
5   0.039791    0.07       0.21       1      a    5
6   0.039271   -6.14       8.88       3      a    6
7   0.038340   -8.13        NaN       1      c    7
8   0.038227    9.26      13.78       2      a    8
9   0.033878    0.02      11.45       1      a    9
10  0.029455    5.91      24.86       3      b   10

I have weights, performances and categories (in reality I have many more columns but this is a minimum working example). I would like to:

  • group first by each categ alone and then by combinations of the first two categories
  • aggregate several columns with different functions using a dictionary (as shown in Wes McKinney's book)
  • calculate the weighted average performances using my user defined function
  • take into into account NaNs in my data
  • do some sorting in my output

The code:

def get_wavg(df, field):
    return np.average(df[field], weights=df['WEIGHT'])

groups = [df['Categ1'], df['Categ2'], [df['Categ1'],df['Categ2']]]   

funcdict = {'Number':'count',
            'WEIGHT': 'sum', \
            'TR_YTD': lambda x: get_wavg(x,'TR_YTD'), \
            'TR_3Y_Ann': lambda x: get_wavg(x,'TR_3Y_Ann')}


for group in groups:  

    # preparing list to sort dynamically
    # sorting by the first layers of groups (excluding last) and then weights
    groupnames = [x.name for x in group]
    sortinglist = groupnames[:-1]
    sortinglist.append('WEIGHTS')
    ascendinglist = [True]*(len(groupnames)-1)+[False]

    # apply agg functions
    grouped = df.groupby(group)
    grouped = grouped.agg(funcdict)

    # sorting
    grouped  = pd.DataFrame(grouped)
    grouped.reset_index(inplace=True)
    grouped.sort_values(by=sortinglist,ascending=ascendinglist,inplace=True)
    grouped.set_index(groupnames,inplace=True)

    print (grouped)

Problem is that I can't find the correct syntax to make it work with my user defined function. If I use np.nanmean instead it works but doesn't give me the result I am looking for.

What is the correct way to do this?

Mth Clv
  • 625
  • 1
  • 7
  • 20
  • Could you clarify a bit what is the result you're after? Why is `np.nanmean` failing you? – Ilja Everilä Sep 13 '16 at 10:14
  • Sure. np.nanmean is working, but I am after a different result. np,nanmean will compute the average performance of my group as if they were equally weighted, whereas I want it weighted adjusted (using the weight column) – Mth Clv Sep 13 '16 at 10:16
  • np.average with a weight argument gives me the result I am looking for but I don't manage to make it work when using multiple functions aggregation – Mth Clv Sep 13 '16 at 10:17
  • also I don't know how to make my get_wavg function ignore the nan values when passing a SeriesGroupby object – Mth Clv Sep 13 '16 at 10:18
  • There are some issues with your code not related to calculating the weighted average, which make testing it a bit hard, but could you not just `'TR_3Y_Ann': lambda x: (x * df['WEIGHT']).sum()` etc. (not accounting for the weights not adding up to 1)? – Ilja Everilä Sep 13 '16 at 10:40
  • What are those issues? I am happy to correct. – Mth Clv Sep 13 '16 at 10:47
  • There is a problem though with the formula you are proposing because it will mechanically reduce the average performance, since I would multiply by the weight and not the proportion in the group (weight / total weight in group) – Mth Clv Sep 13 '16 at 10:49
  • `get_wavg`:s 1st argument is a series like object, not the full blown `df`. Column access on that does not work. In `groups` the first 2 are series objects instead of a list of series objects. That throws `groupnames = [x.name for x in group]` off. It also seems redundant to even hold the full series in the *groups*, just the names would do. – Ilja Everilä Sep 13 '16 at 11:01
  • As for normalizing the weights, `(x * df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum()).sum()`, but I admit that is getting mighty ugly. – Ilja Everilä Sep 13 '16 at 11:06
  • Can you wrap it up in an answer with the code? I am not sure I understand what you mean. I got your point on the names vs series in the variable groups. you are totally right, thanks. – Mth Clv Sep 13 '16 at 11:41
  • There is a problem: df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum() will always be equal to one. Therefore (x * df['WEIGHT'] / df.ix[x.index, 'WEIGHT'].sum()).sum() will always result in sum of x, and hence not what I am looking for – Mth Clv Sep 13 '16 at 12:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123250/discussion-between-ilja-everila-and-mth-clv). – Ilja Everilä Sep 13 '16 at 12:18

0 Answers0