3

I would like to use a third column to weight results in a pandas crosstab.

For example, the following:

import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'bar'],
                   'B': [1, 1, 0, 0, 0],
                   'weight': [2, 3, 4, 5, 6]})
print(pd.crosstab(df.A, df.B))

results in:

B    0  1
A        
bar  2  1
foo  1  1

What I would like as a result is:

B     0  1
A        
bar  11  3
foo   4  2
JohnE
  • 29,156
  • 8
  • 79
  • 109
prooffreader
  • 2,333
  • 4
  • 21
  • 32

2 Answers2

10

You can supply a custom aggregate function to a crosstab using the aggfunc parameter:

pd.crosstab(df.A, df.B, df.weight, aggfunc = sum)
B     0  1
A         
bar  11  3
foo   4  2
maxymoo
  • 35,286
  • 11
  • 92
  • 119
1

This is really wasteful of memory and only works if weights can be interpreted as frequencies (i.e. weights are integers), but it's fairly simple to do:

df2 = df.iloc[ np.repeat( df.index.values, df.weight ) ]

That's just using advanced/fancy indexing to expand the rows in proportion to the weights:

     A  B  weight
0  foo  1       2
0  foo  1       2
1  bar  1       3
1  bar  1       3
1  bar  1       3

Then you can run the crosstab normally:

pd.crosstab(df2.A, df2.B)

B     0  1
A         
bar  11  3
foo   4  2

I suspect it's necessary to write a custom version of crosstab in order to handle weights properly and efficiently as there are very few (if any?) functions in pandas that do weights for you automatically. It wouldn't be all that hard though and maybe someone else will do it as an answer.

Possibly scipy or statsmodels has an automatic way to do this?

JohnE
  • 29,156
  • 8
  • 79
  • 109