This question uses Python-3.7
and pandas-0.23.4
.
I'm currently dealing with financial datasets that I need to only retrieve the data of each trading day between 08:15 to 13:45
Variable Setup
To illustrate this, I have a DataFrame
variable with DateTimeIndex
with continuous minutely frequency declared as the following code:
y = (
pd.DataFrame(columns=['x', 'y'])
.reindex(pd.date_range('20100101', '20100105', freq='1min'))
)
Problem Introduction
I want to slice the data from each day
between 08:15 to 13:45. The following code seems to work but I don't think it's very Pythonic and it seems to not very memory-efficient considering the double indexing at the end:
In [108]: y[y.index.hour.isin(range(8,14))][15:][:-14]
Out[108]:
x y
2010-01-01 08:15:00 NaN NaN
2010-01-01 08:16:00 NaN NaN
2010-01-01 08:17:00 NaN NaN
2010-01-01 08:18:00 NaN NaN
2010-01-01 08:19:00 NaN NaN
... ... ...
2010-01-04 13:41:00 NaN NaN
2010-01-04 13:42:00 NaN NaN
2010-01-04 13:43:00 NaN NaN
2010-01-04 13:44:00 NaN NaN
2010-01-04 13:45:00 NaN NaN
[1411 rows x 2 columns]
EDIT: After thoroughly checked the data, the indexing above does not solve the problem because the data still contains the times after 2010-01-01 13:45:00
and before 2010-01-02 08:15:00
:
In [147]: y[y.index.hour.isin(range(8,14))][15:][:-14].index[300:400]
Out[147]:
DatetimeIndex(['2010-01-01 13:15:00', '2010-01-01 13:16:00',
'2010-01-01 13:17:00', '2010-01-01 13:18:00',
'2010-01-01 13:19:00', '2010-01-01 13:20:00',
...
'2010-01-01 13:35:00', '2010-01-01 13:36:00',
'2010-01-01 13:37:00', '2010-01-01 13:38:00',
'2010-01-01 13:39:00', '2010-01-01 13:40:00',
'2010-01-01 13:41:00', '2010-01-01 13:42:00',
'2010-01-01 13:43:00', '2010-01-01 13:44:00',
'2010-01-01 13:45:00', '2010-01-01 13:46:00', # 13:46:00 should be excluded
'2010-01-01 13:47:00', '2010-01-01 13:48:00', # this should be excluded
'2010-01-01 13:49:00', '2010-01-01 13:50:00', # this should be excluded
'2010-01-01 13:51:00', '2010-01-01 13:52:00', # this should be excluded
'2010-01-01 13:53:00', '2010-01-01 13:54:00', # this should be excluded
'2010-01-01 13:55:00', '2010-01-01 13:56:00', # this should be excluded
'2010-01-01 13:57:00', '2010-01-01 13:58:00', # this should be excluded
'2010-01-01 13:59:00', '2010-01-02 08:00:00', # this should be excluded
'2010-01-02 08:01:00', '2010-01-02 08:02:00', # this should be excluded
'2010-01-02 08:03:00', '2010-01-02 08:04:00', # this should be excluded
'2010-01-02 08:05:00', '2010-01-02 08:06:00', # this should be excluded
'2010-01-02 08:07:00', '2010-01-02 08:08:00', # this should be excluded
'2010-01-02 08:09:00', '2010-01-02 08:10:00', # this should be excluded
'2010-01-02 08:11:00', '2010-01-02 08:12:00', # this should be excluded
'2010-01-02 08:13:00', '2010-01-02 08:14:00', # this should be excluded
'2010-01-02 08:15:00', '2010-01-02 08:16:00',
'2010-01-02 08:17:00', '2010-01-02 08:18:00',
'2010-01-02 08:19:00', '2010-01-02 08:20:00',
...
'2010-01-02 08:47:00', '2010-01-02 08:48:00',
'2010-01-02 08:49:00', '2010-01-02 08:50:00',
'2010-01-02 08:51:00', '2010-01-02 08:52:00',
'2010-01-02 08:53:00', '2010-01-02 08:54:00'],
dtype='datetime64[ns]', freq=None)
Workaround Attempt
I tried multiple boolean masking but the following code will truncate every 0
to 14
AND 46
to 59
minutes of each hour:
y[(
y.index.hour.isin(range(8,14)) & y.index.minute.isin(range(15, 46))
)]
Question
There must be a better way to do this in a more efficient manner that I might miss (or perhaps pandas
has already had the function). What is the more precise/pythonic way to slice the data with DateTimeIndex
? For example:
y[(y.index.day("everyday") & y.index.time_between('08:15', '13:45'))]
or even better:
y[y.index("everyday 08:15 to 13:45")]