2

TLDR: I want to normalize values in a series based on rolling window. I did it partially with .apply() but it's running time is too slow and I'm looking for a better way (performance-wise).

I have a time series with non-stationary data. I'm trying to eliminate the trend, and I want to do so by change each value for the percentage over the last period. For example, with 24-hours window: take the current value and check what percentage he is given the last 24 hours, this should be the new value. I hope this make sense.

To be more concrete, I'll use the following example:

df

0     2.00
1     3.00
2     2.50
3     3.10
4     4.00
5     3.20
6     3.80
7     3.00
8     4.20
9     4.10
10    3.90
11    4.12
12    4.30
13    4.11
dtype: float64

What I did (window size is 4):

output

0      NaN
1      NaN
2      NaN
3     0.75
4     0.75
5     0.50
6     0.50
7     0.00
8     0.75
9     0.50
10    0.25
11    0.50
12    0.75
13    0.25
dtype: float64

The value of the row with index 3 is 0.75 because in the relevant window ([2, 3, 2.5, 3.1]) there are 3 values smaller then 3.1, so it's the 75%. I know it's not technically 100%, but that is what I was able to achieve. The row with index 9 (value 4.1) changed to 0.5 because in the window [3.8, 3, 4.2, 4.1] there are two values smaller then 4.1.

I did it with apply as follows:

df.rolling(4).apply(lambda x: len(x[x < x.iloc[-1]]) / float(x.shape[0]))

it works, but it's too slow. My real data is a time series sampled hourly, spanning over few years. In my real application I normalize over last year, means it's the percentage with window size of 365*24 values. That is why I didn't mind the 75% or 100% - with denominator of 8760, it does not make any difference. Although, if anyone has a suggestion "by the way" to allow both 1 and 0, it's a great bonus. I would also like to somehow do the for the "starting values", aka replace the nan values with 0, 0.5, 0.333 (2 turns to 0, 3 turns to 0.5 since it's larger then 2 and 2.5 turns to 0.3 since it's larger only then 2)

To sum up my questions:

  1. I'm looking for a faster implementation of the normalization I explained
  2. A solution to improve such that it will consider the relevant window without the last value, such that for index 9 the window will be 5,6,7,8 only (which will allow both 0.0 and 1.0 values to enter)
  3. How to handle the nans correctly - start with smaller window that grows to four and then start moving.

my small example to reproduce:

import pandas as pd
df = pd.Series(data=[2, 3, 2.5, 3.1, 4, 3.2, 3.8, 3, 4.2, 4.1, 3.9, 4.12, 4.3, 4.11])
df.rolling(4).apply(lambda x: len(x[x < x.iloc[-1]]) / float(x.shape[0]))

Truly desired output:

0     0.00
1     1.00
2     0.50
3     1.00
4     1.00
5     0.75
6     0.75
7     0.00
8     1.00
9     0.75
10    0.50
11    0.75
12    1.00
13    0.75
dtype: float64

As a side note: I was thinking in a non-pandas way to do so by holding a sorted array of values in the current window, then take out the oldest value, add a new one and so on. With window of size m it takes 2*logm (binary search) and I need to so for all n values in my dataframe. I can code this solution and it's probably pretty efficient in a manner of asymptotic time, but it doesn't use any vectorization abilities.

Roim
  • 2,986
  • 2
  • 10
  • 25

2 Answers2

1

Slightly faster and at least more readable is:

df.rolling(4).apply(lambda x: np.mean(x < x.iloc[-1]))

But for optimal solutions maybe you can use the performance analysis in this post. You might be able to formulate your problem using scipy.convolve :)

Roelant
  • 4,508
  • 1
  • 32
  • 62
  • Thank you, it's indeed a bit faster. I didn't though about improving that part. It is an improvement but not the full solution I hoped for – Roim Oct 02 '20 at 21:00
1

Rolling is generally slow. You can just do a for loop:

window = 4 
s = 0
for i in np.arange(1, window+1):
    s = s+ (df > df.shift(i))

s/window

Output:

0     0.00
1     0.25
2     0.25
3     0.75
4     1.00
5     0.75
6     0.75
7     0.00
8     1.00
9     0.75
10    0.50
11    0.75
12    1.00
13    0.50
Name: 1, dtype: float64
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thank you for your solution. It's a small improvement and may be better, but I guess using a for loop still may be a bit slow (as I still using apply and a for). I will remain optimistic and hope there another solution which improve running time even more – Roim Oct 02 '20 at 20:58
  • It only loops over the windows which considerably is small unless you have a big windows in the thousands, further vectorization won’t yield much improvement. – Quang Hoang Oct 02 '20 at 21:13
  • each window is in size of 8760 (size of a year when data sampled hourly) so I'm not sure.... – Roim Oct 02 '20 at 21:20