What is the fastest way to repeatedly resample timeseries data of the same shape?
Problem: I have 30 years of an hourly timeseries which I want to resample to yearly and by calendar year (resample rule 'AS'). I need to both find the mean in each year and the sum. There are no missing hours. I then need to do this over 10,000 times. For the script I am writing, this resampling step takes by far the most time and is the limiting factor with regards to optimising the run time. Because of leap years, one cannot resample by a consistent 8760 hours as every forth year has 8784 hours.
Example code:
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds
Solutions I have explored:
- I have made some speed improvements by aggregating multiple timeseries into a dataframe and resampling them at the same time; however, due to the restrictions of the set-up of the wider problem I am solving, I am limited to having 10 timeseries in each dataframe. Therefore, the problem still stands: is there a way to dramatically speed up resampling of timeseries data if you know the shape of the array will always be the same?
- I also looked into using numba but this does not make pandas functions quicker.
Possible solutions which sound reasonable but I cannot find after researching:
- resample 3D array of timeseries data with numpy
- Cache the index that is being resampled and then somehow do every resample after the first resample much faster
Thanks for your help :)