Rolling and cumulative standard deviation in a Python dataframe

Question

Is there a vectorized operation to calculate the cumulative and rolling standard deviation (SD) of a Python DataFrame?

For example, I want to add a column 'c' which calculates the cumulative SD based on column 'a', i.e. in index 0, it shows NaN due to 1 data point, and in index 1, it calculates SD based on 2 data points, and so on.

The same question goes to rolling SD too. Is there an efficient way to calculate without iterating through df.itertuples()?

import numpy as np
import pandas as pd

def main():
    np.random.seed(123)
    df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])
    print(df)

if __name__ == '__main__':
    main()

score 15 · Answer 1 · answered Jul 04 '17 at 04:05

For cumulative SD base on columna 'a', let's use rolling with a windows size the length of the dataframe and min_periods = 2:

df['a'].rolling(len(df),min_periods=2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.691916
3 -2.426679 -0.428913  1.133892
4  1.265936 -0.866740  1.395750
5 -0.678886 -0.094709  1.250335
6  1.491390 -0.638902  1.374933
7 -0.443982 -0.434351  1.274843
8  2.205930  2.186786  1.450563
9  1.004054  0.386186  1.403721

And for rolling SD based on two values at a time:

df['c'] = df['a'].rolling(2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.609228
3 -2.426679 -0.428913  1.306789
4  1.265936 -0.866740  2.611073
5 -0.678886 -0.094709  1.375197
6  1.491390 -0.638902  1.534617
7 -0.443982 -0.434351  1.368514
8  2.205930  2.186786  1.873771
9  1.004054  0.386186  0.849855

score 7 · Answer 2 · answered Apr 15 '19 at 23:57

I think, if by rolling you mean cumulative, then the right term in Pandas is expanding:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding

It also accepts a min_periods argument.

df['c'] = df['a'].expanding(2).std()

The case for rolling was handled by Scott Boston, and it is unsurprisingly called rolling in Pandas.

The advantage if expanding over rolling(len(df), ...) is, you don't need to know the len in advance. It is very useful e.g. in groupby dataframes.

Rolling and cumulative standard deviation in a Python dataframe

2 Answers2

Linked