Weighting results in pandas crosstab

Question

I would like to use a third column to weight results in a pandas crosstab.

For example, the following:

import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'bar'],
                   'B': [1, 1, 0, 0, 0],
                   'weight': [2, 3, 4, 5, 6]})
print(pd.crosstab(df.A, df.B))

results in:

B    0  1
A        
bar  2  1
foo  1  1

What I would like as a result is:

How about https://stackoverflow.com/questions/47059124/pandas-crosstab-how-to-calculate-weighted-averages-and-how-to-add-row-and-colu ? — Pythonista anonymous, Nov 01 '17 at 18:23

score 10 · Accepted Answer · answered May 19 '15 at 00:01

10

You can supply a custom aggregate function to a crosstab using the aggfunc parameter:

pd.crosstab(df.A, df.B, df.weight, aggfunc = sum)
B     0  1
A         
bar  11  3
foo   4  2

answered May 19 '15 at 00:01

maxymoo

35,286
11
92
119

Excellent, much better than my answer, especially if your dataframe is large. – JohnE May 19 '15 at 04:08

JohnE · Answer 2 · 2015-05-18T23:46:09.713

This is really wasteful of memory and only works if weights can be interpreted as frequencies (i.e. weights are integers), but it's fairly simple to do:

df2 = df.iloc[ np.repeat( df.index.values, df.weight ) ]

That's just using advanced/fancy indexing to expand the rows in proportion to the weights:

     A  B  weight
0  foo  1       2
0  foo  1       2
1  bar  1       3
1  bar  1       3
1  bar  1       3

Then you can run the crosstab normally:

pd.crosstab(df2.A, df2.B)

B     0  1
A         
bar  11  3
foo   4  2

I suspect it's necessary to write a custom version of crosstab in order to handle weights properly and efficiently as there are very few (if any?) functions in pandas that do weights for you automatically. It wouldn't be all that hard though and maybe someone else will do it as an answer.

Possibly scipy or statsmodels has an automatic way to do this?

Weighting results in pandas crosstab

2 Answers2