I've written below code on Python3 for resampling timeseries data. My requirement is to resample data only between 9:00 AM to 4:00 PM for each day. However I can't do the resample operation separately by grouping the complete data using "group by date" and merge the results thereafter because I want to use ffill() here also.
import pandas as pd
import datetime as dt
import numpy as np
# Generating Sample Data - Start
dateTime = pd.date_range(dt.datetime(2020,7,20,7,0,0), dt.datetime(2020,7,29,17,0,0), freq='1T')
open = np.random.randint(1, 100, len(dateTime))
high = np.random.randint(1, 100, len(dateTime))
low = np.random.randint(1, 100, len(dateTime))
close = np.random.randint(1, 100, len(dateTime))
volume = np.random.randint(1, 100, len(dateTime))
df = pd.DataFrame({'dateTime': dateTime, 'open': open, 'high': high, 'low': low, 'close': close, 'volume': volume})
df = df[df.dateTime.dt.time > dt.time(9,0,0)]
df = df[df.dateTime.dt.time < dt.time(16,0,0)]
# Generating Sample Data - End
# Resample Data - Start
df = df.resample("2T", on='dateTime', base=15).agg({'open':'first', 'high':'max', 'low':'min', 'close':'last', 'volume':'last'}).ffill()
# Resample Data - End
print(df.to_string())
The above mentioned code is working fine but sometime it gives error like "Unable to allocate array with shape......" when data is more. Actually original dataframe contains data only from 9:00 AM to 4:00 PM but when resampling it adds data from 4:01 PM to next day 8:59 AM data also, hence it requires more memory.
So is there any simpler / faster way to do it making it more memory efficient?