I have been working on time-series data resampling using Pandas. It works well and give required results. However, the performance is little slow as per my current requirement.
Problem: I have minute data that I need to resample to many frequencies such as 5min, 15min, 3H
. Pandas resampling works fine when the dataset is small but if I want to resample large number of records (10 days data of 1000 symbols) it's performance decreases significantly.
Tried:
- I have tried to implement resampling in
numpy
arrays but it is even slower (see below). (Probably due to how I implemented it). - Use
apply
withresample
method inpandas
and use a custom function incython
(new for me btw) for aggregation. Performance was below whenresample
andagg
are used. - Similarly to #2, experimented with
numba.jit
, but no improvement. - Split data and used
multiprocessing
for sampling. It improves the performance by ~50% but overhead and compute requirements increase significantly.
Here is the sample code I used for checking:
Observations:
- When I remove
sum
inagg
forpandas
, the performance improves a little (~15%). - Numpy is way slower than expected
from datetime import datetime
from time import time
import numpy as np
import pandas as pd
symbols = 1000
start = datetime(2023, 1, 1)
end = datetime(2023, 1, 2)
data_cols = ['open', 'high', 'low', 'close', 'volume']
agg_props = {'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'volume': 'sum'}
base, sample = '1min', '5min'
def pandas_resample(df: pd.DataFrame):
df.sort_values(['sid', 'timestamp'], inplace=True)
df = df.set_index('timestamp')
re_df = df.groupby('sid').resample(sample, label='left', closed='left').agg(agg_props).reset_index()
return re_df
def numpy_resample(arr):
intervals = pd.date_range(arr[:, 1].min(), arr[:, 1].max(), freq=sample)
intervals = list(zip(intervals[:-1], intervals[1:]))
# chunk_dates(data_df.index.min(), data_df.index.max(), interval=self.freq, as_range=True)
data = []
groups = np.unique(arr[:, 0])
for _group in groups:
group_data = arr[arr[:, 0] == _group, :]
for _start, _end in intervals:
# print(_start)
_data_filter = (_start <= group_data[:, 1]) & (group_data[:, 1] < _end)
_interval_data = group_data[_data_filter]
_interval_agg = [_group, _start]
_skip = len(_interval_data) == 0
for _val, _key in [['open', 2], ['high', 3], ['low', 4], ['close', 5], ['volume', 6]]:
# print(_key)
_col_data = _interval_data[:, _key]
if not _skip:
if _val in ['open']: _key_val = _col_data[0]
if _val in ['high']: _key_val = _col_data.max()
if _val in ['low']: _key_val = _col_data.min()
if _val in ['close']: _key_val = _col_data[-1]
if _val in ['volume']: _key_val = _col_data.sum()
else:
_key_val = None
_interval_agg.append(_key_val)
data.append(_interval_agg)
return data
if __name__ == '__main__':
timestamps = pd.date_range(start, end, freq=base)
candles = pd.DataFrame({'timestamp': pd.DatetimeIndex(timestamps), **{_col: np.random.randint(50, 150, len(timestamps)) for _col in data_cols}})
symbol_id = pd.DataFrame({'sid': np.random.randint(1000, 2000, symbols)})
candles['id'] = 1
symbol_id['id'] = 1
data_df = symbol_id.merge(candles, on='id').drop(columns=['id'])
print(len(data_df), "\n", data_df.head(3))
st1 = time()
resampled_df = pandas_resample(data_df.copy())
print('pandas', time() - st1)
st2 = time()
numpy_resample(data_df.values)
print('numpy', time() - st2)
Output
pandas 3.5319528579711914
numpy 93.10612797737122
Please suggest if there is any other approach or implementation which could result in better performance.
Thanks in advance !!