I have a pandas DataFrame that I have written to an HDF5 file. The data is indexed by Timestamps and looks like this:
In [5]: df
Out[5]:
Codes Price Size
Time
2015-04-27 01:31:08-04:00 T 111.75 23
2015-04-27 01:31:39-04:00 T 111.80 23
2015-04-27 01:31:39-04:00 T 113.00 35
2015-04-27 01:34:14-04:00 T 113.00 85
2015-04-27 01:55:15-04:00 T 113.50 203
... ... ... ...
2015-05-26 11:35:00-04:00 CA 110.55 196
2015-05-26 11:35:00-04:00 CA 110.55 98
2015-05-26 11:35:00-04:00 CA 110.55 738
2015-05-26 11:35:00-04:00 CA 110.55 19
2015-05-26 11:37:01-04:00 110.55 12
What I would like is to create a function that I can pass a pandas DatetimeIndex and it will return a DataFrame with the rows at or right before each Timestamp in the DatetimeIndex.
The problem I'm running into is that concatenated read_hdf queries won't work if I am looking for more than 30 rows -- see [pandas read_hdf with 'where' condition limitation?
What I am doing now is this, but there has to be a better solution:
from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz
def getRows(file, dataset, index):
if len(index) == 1:
start = index.date[0]
end = (index.date + timedelta(days=1))[0]
else:
start = index.date.min()
end = (index.date.max() + timedelta(days=1))
where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'
df = read_hdf(file, dataset, where=where)
df = df.groupby(level=0).last().reindex(index, method='pad')
return df