22

The rolling window function pandas.DataFrame.rolling takes a window argument that is described as follows:

window : int, or offset

Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

What actually is an offset in this context?

cottontail
  • 10,268
  • 18
  • 50
  • 51
ascripter
  • 5,665
  • 12
  • 45
  • 68

2 Answers2

24

In a nutshell, if you use an offset like "2D" (2 days), pandas will use the datetime info in the index (if available), potentially accounting for any missing rows or irregular frequencies. But if you use a simple int like 2, then pandas will treat the index as a simple integer index [0,1,2,...] and ignore any datetime info in the index.

A simple example should make this clear:

df=pd.DataFrame({'x':range(4)}, 
    index=pd.to_datetime(['1-1-2018','1-2-2018','1-4-2018','1-5-2018']))

            x
2018-01-01  0
2018-01-02  1
2018-01-04  2
2018-01-05  3

Note that (1) the index is a datetime, but also (2) it is missing '2018-01-03'. So if you use a plain integer like 2, rolling will just look at the last two rows, regardless of the datetime value (in a sense it's behaving like iloc[i-1:i] where i is the current row):

df.rolling(2).count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  2.0
2018-01-05  2.0

Conversely, if you use an offset of 2 days ('2D'), rolling will use the actual datetime values and accounts for any irregularities in the datetime index.

df.rolling('2D').count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  1.0
2018-01-05  2.0

Also note, you need the index to be sorted in ascending order when using a date offset, but it doesn't matter when using a simple integer (since you're just ignoring the index anyway).

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • 1
    to give offset in mircroseconds (say10), what string will be used?, I tried '10f', didn't work – Legolas Aug 20 '19 at 09:48
  • 3
    @Legolas this info is surprisingly hard to find altho guessing often works well in this situation ;-) I would use 'us' but 'micro' or 'microseconds' should work too. See here for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html#pandas.Timedelta – JohnE Aug 20 '19 at 17:09
  • @JohnE That doc doesn't mention lower-case `d` like in the example, only `days`. But only `2d` seems to work, while `2days` doesn't (?). – thakis Jan 06 '20 at 02:30
  • 1
    @thakis Thanks for noting this. I'm not really sure why "days" doesn't work here. In other cases you can specify the offset for days in those other ways (e.g. try `pd.Timedelta('2days')`) but I also can't get "days" to work here. FWIW, I think that 'D' is probably the most standard or canonical way to specify "days" here so I changed the example above to use `2D` instead of `2d` – JohnE Jan 06 '20 at 03:50
  • 2
    @thakis This is a little silly, probably, but if you really wanted to use "days" for some reason, you could do it like this, by explicitly converting to a timedelta: `df.rolling(pd.Timedelta('2days')).count()` – JohnE Jan 06 '20 at 03:53
  • For a list of accepted date offsets see: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects – Marnix.hoh May 14 '20 at 09:17
1

Offsets can be either offset (strings) or pd.Timedelta objects. Internally, both of them get converted into an offset using pd.tseries.frequencies.to_offset() method.

The basic implementation of a rolling window computation with an offset is that for any given index, the offset is subtracted from it creating a slice of the column, i.e. a window, and the function (e.g. max() below) is called on this window. By default, first point in the window edge is excluded from the computation.

df = pd.DataFrame({'A': [1,4,3,2]}, 
                  index=pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-04']))

df['A'].rolling('2D').max()

2020-01-01    1.0
2020-01-02    4.0
2020-01-02    4.0
2020-01-04    2.0
Name: A, dtype: float64

In the example above, the calculations are made in the following windows:

2020-01-01               = max(1)          # `min_periods=1` on time-series
[2020-01-01, 2020-01-02] = max(1, 4)       # only the first value on 2020-01-02 is considered because the second is not seen yet
[2020-01-01, 2020-01-02] = max(1, 4, 3)
[2020-01-03, 2020-01-04] = max(2)          # there is no data on 2020-01-03

All possible offsets are in the pd.offsets module. Among those, only the ones with fixed frequency are valid offsets.1 They are:

  • Day (D)
  • Hour (H)
  • Minute (T)
  • Second (S)
  • Milli (L)
  • Micro (U)
  • Nano (N)

Also the index can be DatetimeIndex, TimedeltaIndex or PeriodIndex. In fact, the rolling window don't even have to be on the index; it can be on a column of datetime, timedelta or period dtype.

So for example, to use a rolling window of 3 microseconds, use 3U.

df = pd.DataFrame({
    'time': pd.date_range('2020-01-01 12:00:00', '2020-01-01 12:00:01', 10**6), 
    'value': 1}).head()
# check that each step is indeed 1 microsecond
df['time'].diff().dropna().dt.microseconds.eq(1).all()   # True

df.rolling('3U', on='time')['value'].sum()

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
Name: value, dtype: float64

1 The following code returns these offsets.

offsets = {
    name: obj().name for name in dir(pd.offsets) 
    if hasattr((obj:=getattr(pd.offsets, name)), '_nanos_inc')
}
cottontail
  • 10,268
  • 18
  • 50
  • 51