As an aspiring data scientist, I am currently learning to work with time series and just finished learning window functions. It is clear to me that rolling window functions help compute a moving metric, such as average or sum, of time series data. However, I am struggling to understand the computational logic behind rolling window functions that use 'D' as part of the input. Below is the example:
I have the following dataset:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Arsik36/StO/master/yahoo.csv',
parse_dates = ['date'], index_col = 'date')
df.head()
From the output in your environment, you will see dataset contains date column as the index and then corresponding values. The logic is perfectly clear to me when I set window = 5, such as below:
df['window_5'] = df.rolling(window = 5).mean()
df
The new column creates several NaN rows at first, and then computes the mean of the last 5 dates, crystal clear. However, when I specify the window argument to be '5D' - 5 calendar days - the new column does not produce NaN values at the beginning.
df['window_5D'] = df['price'].rolling(window = '5D').mean()
df
Through my own analysis, I realize that the value in the first row of 'window_5D' column is the mean of first column in 'price', the value in the second row of 'window_5D' column is the mean of first 2 rows of 'price' column, and so on. What I don't understand is why are computations done this way, if I specify the window of size '5D'?
The dataset I included includes Yahoo stock prices. On weekends, price remains the same. So, in my mind, '5D' should create the same first several NaN values as if I specify window = 5, but unlike window = 5, window = 5D would also assume that on weekends price stayed the same as on Friday, and would take that into account when computing mean.
window = '5D' concept is what I am confused about, and I thank you in advance in helping me understand the logic behind this computation given my confusions with the scenario above.