11

Using pandas, is it possible to compute a single cross-tabulation (or pivot table) containing values calculated from two different functions?

import pandas as pd
import numpy as np

c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)

df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})

frequencyTable = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
meanVal = pd.crosstab([df.c1,df.c2],[df.c3,df.c4],values=df.val,aggfunc=np.mean)

So, both the rows and the columns are the same in both tables, but what I'd really like is a table with both frequencies and mean values:

c3           G1                       G2          
c4     1              2              1              2
c1 c2  freq val       freq val       freq val       freq val         
a  x   6    0.624931  5    0.582268  8    0.528231  6    0.362804
   y   7    0.493890  8    0.465741  3    0.613126  7    0.312894
b  x   9    0.488255  5    0.804015  6    0.722640  5    0.369480
   y   6    0.462653  4    0.506791  5    0.583695  10   0.517954
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
HappyPy
  • 9,839
  • 13
  • 46
  • 68

1 Answers1

15

You can give a list of functions:

pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])

If you want the table as shown in your question, you will have to rearrange the levels a bit:

In [42]: table = pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])

In [43]: table
Out[43]: 
       len                mean                              
c3      G1     G2           G1                  G2          
c4       1  2   1  2         1         2         1         2
c1 c2                                                       
a  x     4  6   8  7  0.303036  0.414474  0.624900  0.425234
   y     5  5   8  7  0.543363  0.480419  0.583499  0.637657
b  x    10  6   4  5  0.400279  0.436929  0.442924  0.287572
   y     6  8   5  6  0.400427  0.623319  0.764506  0.408708

In [44]: table.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
Out[44]: 
c3      G1                            G2                         
c4       1              2              1              2          
       len      mean  len      mean  len      mean  len      mean
c1 c2                                                            
a  x     4  0.303036    6  0.414474    8  0.624900    7  0.425234
   y     5  0.543363    5  0.480419    8  0.583499    7  0.637657
b  x    10  0.400279    6  0.436929    4  0.442924    5  0.287572
   y     6  0.400427    8  0.623319    5  0.764506    6  0.408708
joris
  • 133,120
  • 36
  • 247
  • 202
  • BTW, is it possible to make "len" and "mean" columns of their own when reshaping with melt? I tried: pd.melt(table.reset_index(), id_vars=['c1','c2']), but I get a column named "NaN" with "len" and "mean" as elements. Thanks! – HappyPy Sep 05 '13 at 17:50
  • 1
    What do you exactly mean with "columns of their own"? Do you want one column with "mean" and one column with "len" besided the columns "c1", "c2", "c3", "c4", as you would get with `df.groupby(["c1", "c2", "c3", "c4"]).aggregate([len, np.mean]).reset_index()`? In that case you can do it directly instead of via crosstab and melt. – joris Sep 05 '13 at 19:36