Python Pandas rolling sum operation for subset of dataframe

Question

This is best explained through an example.

I have the following dataframe (each row can be thought of as a transaction):

DATE        AMOUNT
2017-01-29  10 
2017-01-30  20
2017-01-31  30
2017-02-01  40
2017-02-02  50
2017-02-03  60

I would like to compute a 2-day rolling sum but only for rows in February.

Code snippet I have currently:

df.set_index('DATE',inplace=True)

res=df.rolling('2d')['AMOUNT'].sum()

which gives:

            AMOUNT 
2017-01-29  10 
2017-01-30  30 
2017-01-31  50 
2017-02-01  70 
2017-02-02  90 
2017-02-03  110

but I really only need the output in the last 3 rows, the operations on the first 3 rows are unnecessary. When the dataframe is huge, this incurs immense time complexity. How do I compute the rolling sum only for the last 3 rows (other than computing the rolling sum for all rows and then doing a row filtering operation after that)?

*I cannot pre-filter the dataframe either because there wouldn't be the 'lookback' period in January for the correct rolling sum value to be obtained.

What is expected output? There are data with multiple years? — jezrael, May 24 '18 at 07:00
@jezrael I'm dealing with a dataset of a few million transactions over 2 years but I only require rolling sum values computed for the second year. The rolling window is 1 year. Thus when I apply the rolling function to all rows, there's 'wasted' computation for a year worth of transactions. — Zi Xiang, May 24 '18 at 07:18

score 0 · Answer 1 · answered May 24 '18 at 07:14

0

You can use timedelta to filter your df and keep the last day of January.

import datetime  

dateStart = datetime.date(2017, 2, 1) - datetime.timedelta(days=1)
dateEnd = datetime.date(2017, 2, 3)
df.loc[dateStart:dateEnd]

Then you can do your rolling operation and drop the first line (which is 2017-01-31)

answered May 24 '18 at 07:14

Charles R

1,621
1
8
25

I did consider this, but the problem is that when my rolling window increases to say 1 year, I'd have to keep an extra year of transactions. Ideally I want to compute the rolling sum _only_ for the February transactions. – Zi Xiang May 24 '18 at 07:29
An other way is to create a column with the value of days -1 : df['AMOUT1'] = df.Amout.shift(1) and then filter the df and do df.AMOUT = df.AMOUT + df.AMOUT1 – Charles R May 24 '18 at 08:07
df['AMOUNT1'] = df.AMOUNT.shift(1) assumes the rolling window is a fixed number of rows. Not sure how the shifting should be implemented with a time-based window (unless I resample the dataframe such that every row corresponds to 1 day). Does get me what I want but the additional operations could be more time eating than saving. – Zi Xiang May 24 '18 at 19:06

gyx-hh · Answer 2 · 2018-05-25T10:10:55.753

0

you can just compute the rolling sum only for the last rows by using tail(4)

res = df.tail(4).rolling('2d')['AMOUNT'].sum()

Output:

DATE
2017-01-31      NaN
2017-02-01     70.0
2017-02-02     90.0
2017-02-03    110.0
Name: AMOUNT, dtype: float64

If you want to merge those values - excluding 2017-01-31 then you can do the following:

df.loc[res.index[1:]] = res.tail(3)

Output:

            AMOUNT
DATE    
2017-01-29  10.0
2017-01-30  20.0
2017-01-31  30.0
2017-02-01  70.0
2017-02-02  90.0
2017-02-03  110.0

edited May 25 '18 at 10:10

answered May 24 '18 at 11:33

gyx-hh

1,421
1
10
15

That gives me 40 for the 2017-02-01 row, which is incorrect (should be 70). – Zi Xiang May 24 '18 at 19:00
Hmm ok you're right.. but I think you just needed to do tail(4)? But I'm on my phone now so I didn't test it. – gyx-hh May 24 '18 at 19:25
@ZiXiang doing `tail(4)` does give the data you wanted. I have also edited the answer to show you how to add `res` to the main df if this is desired. – gyx-hh May 25 '18 at 10:12

Python Pandas rolling sum operation for subset of dataframe

2 Answers2