0

I have a CSV with 1M records. Each record is a unique site/product/date. I am trying to use the .rolling to get a moving average for each site/product across a number of dates. However, the dates are not sorted in chronological order. My question is if I use the .rolling function similar to this:

df.groupby(level='IDs').apply(lambda x: x.rolling(window=2).sum())

...will the rolling average compute in chronological order or list order? I want chronological and trying to avoid having the code spend time to sort 1M records.

cs95
  • 379,657
  • 97
  • 704
  • 746

1 Answers1

0

It really needs to be sorted. This becomes apparent if you give it an offset as the window size (for datetimes) instead of an integer.

Sample Data

import pandas as pd
n = 6
df = pd.DataFrame({'date': pd.date_range('2018-01-01', '2018-01-03', periods=n),
                   'val': range(n)})
df = df.set_index('date')

When in order:

df.rolling(2, on='date').sum()
                 date  val
0 2018-01-01 00:00:00  NaN
1 2018-01-01 09:36:00  1.0
2 2018-01-01 19:12:00  3.0
3 2018-01-02 04:48:00  5.0
4 2018-01-02 14:24:00  7.0
5 2018-01-03 00:00:00  9.0

When unordered it doesn't sort and respect the date ordering.

df.sample(frac=1, random_state=123).rolling(2, on='date').sum()
                 date  val
1 2018-01-01 09:36:00  NaN
3 2018-01-02 04:48:00  4.0
4 2018-01-02 14:24:00  7.0
0 2018-01-01 00:00:00  4.0
2 2018-01-01 19:12:00  2.0
5 2018-01-03 00:00:00  7.0

Now if you want a 2 Day rolling sum, you'd do something like:

df.rolling(window='2D', on='date').sum()
                 date   val
0 2018-01-01 00:00:00   0.0
1 2018-01-01 09:36:00   1.0
2 2018-01-01 19:12:00   3.0
3 2018-01-02 04:48:00   6.0
4 2018-01-02 14:24:00  10.0
5 2018-01-03 00:00:00  15.0

But when not sorted, you get an error indicating that you should have sorted first:

df.sample(frac=1, random_state=123).rolling(window='2D', on='date').sum()
ValueError: date must be monotonic
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Great, thank you. This is what I will do unless you can see another way to accomplish the same thing with less run time. – Question_Mark Dec 06 '18 at 19:38