0

I've written below code on Python3 for resampling timeseries data. My requirement is to resample data only between 9:00 AM to 4:00 PM for each day. However I can't do the resample operation separately by grouping the complete data using "group by date" and merge the results thereafter because I want to use ffill() here also.

import pandas as pd
import datetime as dt
import numpy as np

# Generating Sample Data - Start
dateTime = pd.date_range(dt.datetime(2020,7,20,7,0,0), dt.datetime(2020,7,29,17,0,0), freq='1T')
open = np.random.randint(1, 100, len(dateTime))
high = np.random.randint(1, 100, len(dateTime))
low = np.random.randint(1, 100, len(dateTime))
close = np.random.randint(1, 100, len(dateTime))
volume = np.random.randint(1, 100, len(dateTime))
df = pd.DataFrame({'dateTime': dateTime, 'open': open, 'high': high, 'low': low, 'close': close, 'volume': volume})
df = df[df.dateTime.dt.time > dt.time(9,0,0)]
df = df[df.dateTime.dt.time < dt.time(16,0,0)]
# Generating Sample Data - End

# Resample Data - Start
df = df.resample("2T", on='dateTime', base=15).agg({'open':'first', 'high':'max', 'low':'min', 'close':'last', 'volume':'last'}).ffill()
# Resample Data - End

print(df.to_string())

The above mentioned code is working fine but sometime it gives error like "Unable to allocate array with shape......" when data is more. Actually original dataframe contains data only from 9:00 AM to 4:00 PM but when resampling it adds data from 4:01 PM to next day 8:59 AM data also, hence it requires more memory.

So is there any simpler / faster way to do it making it more memory efficient?

BARUN
  • 139
  • 12
  • when in inserts through the night is there already data in dataframe for next day? If so just resample current day by filtering your df to current day only before resampling – Rob Raymond Jul 29 '20 at 12:05
  • Yes. You mean to say group it first by date then resample within each date and merge finally. Right? That is very slower approach. Any faster and memory efficient approach on this? – BARUN Jul 29 '20 at 12:13
  • No - see below. Not use of grouper, just do it in batches – Rob Raymond Jul 29 '20 at 12:22
  • there was an error, updated. Indexes getting bashed :) now down to correct 2100 rows – Rob Raymond Jul 29 '20 at 12:43

1 Answers1

1

Following modification will prevent resample() from inserting values through the night between trading days. Rows is reduced from 6690 to 2100

import pandas as pd
import datetime as dt
import numpy as np

# Generating Sample Data - Start
dateTime = pd.date_range(dt.datetime(2020,7,20,7,0,0), dt.datetime(2020,7,29,17,0,0), freq='1T')
open = np.random.randint(1, 100, len(dateTime))
high = np.random.randint(1, 100, len(dateTime))
low = np.random.randint(1, 100, len(dateTime))
close = np.random.randint(1, 100, len(dateTime))
volume = np.random.randint(1, 100, len(dateTime))
df = pd.DataFrame({'dateTime': dateTime, 'open': open, 'high': high, 'low': low, 'close': close, 'volume': volume})
df = df[df.dateTime.dt.time > dt.time(9,0,0)]
df = df[df.dateTime.dt.time < dt.time(16,0,0)]
# Generating Sample Data - End
# Resample Data - Start
newdf = pd.DataFrame()
for day in df.dateTime.dt.floor("D").unique():
    mask = df[df.dateTime.dt.floor("D")==day].index
    newdf = pd.concat([newdf, df.loc[mask].resample("2T", on="dateTime", base=15).agg({'open':'first', 'high':'max', 'low':'min', 'close':'last', 'volume':'last'}).ffill().reset_index()])
# Resample Data - End
newdf.reset_index(drop=True)
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • This one is nice but the problem is.... For any day, if timeseries data starts from 10:00 AM instead of 9:00 AM, then it will not have resampled data from 9:00 AM for that day. But in original one, it will resample anyway from 9:00 AM by forward filling from yesterday data.. – BARUN Jul 29 '20 at 12:43
  • I mean to say as that was continuous resample so I will get at least data from 9:00 AM after resample if for any day data starts from 10:00 AM. – BARUN Jul 29 '20 at 12:44
  • df = df.append({'dateTime': dt.datetime(2020,7,30,10,0,0), 'open': 13, 'high': 14, 'low': 12, 'close': 16, 'volume': 90}, ignore_index=True) – BARUN Jul 29 '20 at 13:00
  • Add the above line in code and I will not be getting resampled data from 9:00 AM on day 2020-07-30 – BARUN Jul 29 '20 at 13:01
  • you need a range `df = df.append({'dateTime': dt.datetime(2020,7,30,10,0,0), 'open': 13, 'high': 14, 'low': 12, 'close': 16, 'volume': 90}, ignore_index=True)` plus `df = df.append({'dateTime': dt.datetime(2020,7,30,16,0,0), 'open': 13, 'high': 14, 'low': 12, 'close': 16, 'volume': 90}, ignore_index=True)` and it will resample. need at least two rows – Rob Raymond Jul 29 '20 at 14:02
  • Yes but how will that work when data is there for multiple days.. – BARUN Jul 30 '20 at 05:42