0

What is the fastest way to repeatedly resample timeseries data of the same shape?

Problem: I have 30 years of an hourly timeseries which I want to resample to yearly and by calendar year (resample rule 'AS'). I need to both find the mean in each year and the sum. There are no missing hours. I then need to do this over 10,000 times. For the script I am writing, this resampling step takes by far the most time and is the limiting factor with regards to optimising the run time. Because of leap years, one cannot resample by a consistent 8760 hours as every forth year has 8784 hours.

Example code:

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds

Solutions I have explored:

  1. I have made some speed improvements by aggregating multiple timeseries into a dataframe and resampling them at the same time; however, due to the restrictions of the set-up of the wider problem I am solving, I am limited to having 10 timeseries in each dataframe. Therefore, the problem still stands: is there a way to dramatically speed up resampling of timeseries data if you know the shape of the array will always be the same?
  2. I also looked into using numba but this does not make pandas functions quicker.

Possible solutions which sound reasonable but I cannot find after researching:

  1. resample 3D array of timeseries data with numpy
  2. Cache the index that is being resampled and then somehow do every resample after the first resample much faster

Thanks for your help :)

  • Is your data evenly spaced (always half an hour)? Are there some missing hours? If yes and no - then you could use some [signal resampling technique](https://librosa.org/doc/main/generated/librosa.resample.html) – dankal444 Nov 21 '21 at 18:05
  • Have you considered simplifying resampling by calculating average per year? – dankal444 Nov 21 '21 at 18:08
  • Hi @dankal444, please see my edits -- I need to find the sum and the mean. If there is a faster way to repeatedly resample for mean or for sum, it would be great to learn how. And regarding the signal resampling technique, plz correct if I am wrong but it looks like it can only resample to a target sampling rate which needs to be a number and it can't be a calendar year which is sometimes 8760 hour and sometimes 8784 hours – Mike Twomey Nov 21 '21 at 18:41
  • Ok. More questions. All those 10,000 timeseries have exactly the same datetimes? If yes Maybe you could store indices for each year and re-use them for calculating mean faster. And they are stored in separate dataframes? I think there might be even more questions - the best would be if you provided some simple reproducible example - with same generator of fake data. – dankal444 Nov 21 '21 at 18:51
  • Hi @dankal444, yes all the datetimes are exactly the same. The storing of the indices sounds promising. I would be very interested to know of a way to extract indices of each calendar year and then calculating a mean using those. Thanks also for the suggestions on how to make this question better, this is my first question so big thanks! – Mike Twomey Nov 21 '21 at 18:56
  • No problem, [here is guide how to do such a minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Let me know if you provide such. – dankal444 Nov 21 '21 at 18:59
  • Thanks, @dankal444. Please see my (hopefully) reprex code :) – Mike Twomey Nov 21 '21 at 19:26
  • Good job, now it is *much* easier for me to help you :) – dankal444 Nov 21 '21 at 21:01

1 Answers1

0

As I wrote in comment, I prepared indices for each year and used them to calculate sum much for each year faster.

Next I removed unnecessary calculation of sum under mean again, instead calculating mean as sum/length_of_indices for each year.

For N=1000 its ~9x faster

import pandas as pd
import numpy as np
import time

hourly_timeseries = pd.DataFrame(
    index=pd.date_range(
    pd.Timestamp(2020, 1, 1, 0, 0),
    pd.Timestamp(2050, 12, 31, 23, 30),
    freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
    yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
    yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")


start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100):  # setting as 100 so it runs faster, this is 10,000+ in practice
    temp = hourly_timeseries.values
    yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
    yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year

finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)
Ran in 0.9950 seconds
Ran in 0.1386 seconds
dankal444
  • 3,172
  • 1
  • 23
  • 35