What I am trying to do
I am calculating certain pivot tables with pandas
, and I want to divide one by the other.
An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.
Another example is to calculate weighted averages, where in the last step you need to divide by the weight.
The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).
A minimum reproducible example
In the example below, I create the dataframe piv_all_sales
with dimensions 7x8:
- 7 rows: 2 regions x 3 products + 1 row for the total
- 8 columns: 2 metrics (gross and net sales) x (3 types of quality (low, medium, high) + 1 column for the total)
piv_all_sales
looks like this:
piv_count
counts how many items I have sold, and has dimensions 7x4:
I want to divide the former by the latter - but the DataFrame.div()
method doesn't work - presumably because the two dataframes have different column names.
An additional complication is that piv_all_sales
has 8 columns while piv_count
has 4
import numpy as np
import pandas as pd
rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df['region'] = np.repeat(['USA','Canada'],12)
df['product'] = np.tile(['apples','strawberries','bananas'],8)
df['quality'] = np.repeat(['high','medium','low'],8)
df['net sales'] = rng.integers(low=0, high=100, size=24)
df['gross sales'] = rng.integers(low=50, high=150, size=24)
df['# items sold'] = rng.integers(low=1, high=20, size=24)
piv_net_sales = pd.pivot_table(data=df,
values=['net sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_all_sales = pd.pivot_table(data=df,
values=['net sales','gross sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_count = pd.pivot_table(data=df,
values=['# items sold'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
What I have tried
I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.
So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:
out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)
presumably because pandas looks for, and doesn't find, columns with the same names?
Neither works because both produce a 7x8 dataframe of all nans
Partial, inefficient solution
The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:
it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices
I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?
works but not very elegant nor efficient
out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()
Similar questions
I have found some similar questions, e.g.
How to divide two Pandas Pivoted Tables
but I have been unable to adapt those answers to my case.