Cumulative sum for duplicate rows

Question

So I am having some issues with finding cumulative sums using pandas.

I have a dataframe like this:

df = pd.DataFrame({
    'Date': ['2018-04-01', '2018-04-01', '2018-04-01', '2018-05-01', '2018-05-01','2018-05-01','2018-04-01','2018-05-01'],
    'Category': ['AA', 'AA', 'AA', 'AA', 'AA','AA','AA','AA'],
    'Product': ['a', 'a', 'a', 'a', 'a','a','x','x'],
    'Volumes': [10,30,40,50,60,10,1,2]})

Date         Category       Product     Volumes
2018-04-01     AA               a           10
2018-04-01     AA               a           30
2018-04-01     AA               a           40
2018-05-01     AA               a           50
2018-05-01     AA               a           60
2018-05-01     AA               a           10
2018-04-01     AA               x           1
2018-05-01     AA               x           2

That is, some products are duplicated for the same date and some are unique.

I want to find the cumulative sum in this way:

df = pd.DataFrame({
    'Date': ['2018-04-01', '2018-04-01', '2018-04-01', '2018-05-01', '2018-05-01','2018-05-01','2018-04-01','2018-05-01'],
    'Category': ['AA', 'AA', 'AA', 'AA', 'AA','AA','AA','AA'],
    'Product': ['a', 'a', 'a', 'a', 'a','a','x','x'],
    'Volumes': [80,80,80,190,190,190,1,3]})


Date         Category       Product     Volumes
2018-04-01     AA               a           80
2018-04-01     AA               a           80
2018-04-01     AA               a           80
2018-05-01     AA               a           200
2018-05-01     AA               a           200
2018-05-01     AA               a           200
2018-04-01     AA               x           1
2018-05-01     AA               x           3

Where 80 is the sum of the volumes for April, and 200 is the sum of the volumes for April and May.

I have tried a simple

 df.groupby(['Category', 'Product'])['Volumes'].agg(['sum']).reset_index()```


But that doesnt give me the desired output. 

Any suggestions?

BENY · Accepted Answer · 2019-05-10T14:12:25.453

Update, using cumsum after sum , then merge back

s=df.groupby([df['Date'].dt.month,df['Category'],df['Product']])['Volumes'].sum().reset_index()
s['New']=s.groupby(['Category','Product'])['Volumes'].cumsum()
df=df.assign(Date=df.Date.dt.month,Date1=df.Date).merge(s.drop('Volumes',1),on=['Date','Category','Product'])
df
Out[575]: 
   Date Category Product  Volumes      Date1  New
0     4       AA       a       10 2018-04-01   80
1     4       AA       a       30 2018-04-01   80
2     4       AA       a       40 2018-04-01   80
3     5       AA       a       50 2018-05-01  200
4     5       AA       a       60 2018-05-01  200
5     5       AA       a       10 2018-05-01  200
6     4       AA       x        1 2018-04-01    1
7     5       AA       x        2 2018-05-01    3

score 0 · Answer 2 · answered May 10 '19 at 14:19

0

You can get your desired answer using two separate groupby calls:

df['new'] = df.groupby(['Category','Product'])['Volumes'].transform('cumsum')
df['new'] = df.groupby(['Date','Category','Product'])['new'].transform('max')

answered May 10 '19 at 14:19

T Burgis

1,395
7
9

Cumulative sum for duplicate rows

2 Answers2