TLDR: I want to normalize values in a series based on rolling window. I did it partially with .apply()
but it's running time is too slow and I'm looking for a better way (performance-wise).
I have a time series with non-stationary data. I'm trying to eliminate the trend, and I want to do so by change each value for the percentage over the last period. For example, with 24-hours window: take the current value and check what percentage he is given the last 24 hours, this should be the new value. I hope this make sense.
To be more concrete, I'll use the following example:
df
0 2.00
1 3.00
2 2.50
3 3.10
4 4.00
5 3.20
6 3.80
7 3.00
8 4.20
9 4.10
10 3.90
11 4.12
12 4.30
13 4.11
dtype: float64
What I did (window size is 4):
output
0 NaN
1 NaN
2 NaN
3 0.75
4 0.75
5 0.50
6 0.50
7 0.00
8 0.75
9 0.50
10 0.25
11 0.50
12 0.75
13 0.25
dtype: float64
The value of the row with index 3 is 0.75 because in the relevant window ([2, 3, 2.5, 3.1]) there are 3 values smaller then 3.1, so it's the 75%. I know it's not technically 100%, but that is what I was able to achieve. The row with index 9 (value 4.1) changed to 0.5 because in the window [3.8, 3, 4.2, 4.1] there are two values smaller then 4.1.
I did it with apply
as follows:
df.rolling(4).apply(lambda x: len(x[x < x.iloc[-1]]) / float(x.shape[0]))
it works, but it's too slow. My real data is a time series sampled hourly, spanning over few years. In my real application I normalize over last year, means it's the percentage with window size of 365*24 values. That is why I didn't mind the 75% or 100% - with denominator of 8760, it does not make any difference. Although, if anyone has a suggestion "by the way" to allow both 1 and 0, it's a great bonus. I would also like to somehow do the for the "starting values", aka replace the nan values with 0, 0.5, 0.333 (2 turns to 0, 3 turns to 0.5 since it's larger then 2 and 2.5 turns to 0.3 since it's larger only then 2)
To sum up my questions:
- I'm looking for a faster implementation of the normalization I explained
- A solution to improve such that it will consider the relevant window without the last value, such that for index 9 the window will be 5,6,7,8 only (which will allow both 0.0 and 1.0 values to enter)
- How to handle the nans correctly - start with smaller window that grows to four and then start moving.
my small example to reproduce:
import pandas as pd
df = pd.Series(data=[2, 3, 2.5, 3.1, 4, 3.2, 3.8, 3, 4.2, 4.1, 3.9, 4.12, 4.3, 4.11])
df.rolling(4).apply(lambda x: len(x[x < x.iloc[-1]]) / float(x.shape[0]))
Truly desired output:
0 0.00
1 1.00
2 0.50
3 1.00
4 1.00
5 0.75
6 0.75
7 0.00
8 1.00
9 0.75
10 0.50
11 0.75
12 1.00
13 0.75
dtype: float64
As a side note: I was thinking in a non-pandas way to do so by holding a sorted array of values in the current window, then take out the oldest value, add a new one and so on. With window of size m it takes 2*logm (binary search) and I need to so for all n values in my dataframe. I can code this solution and it's probably pretty efficient in a manner of asymptotic time, but it doesn't use any vectorization abilities.