0

I have a Pandas Dataframe with a DateTime index. It has closing prices of some stocks sampled at the 1-minute interval. I want to resample this dataframe and get it at the 5-minute interval, as if it had been collected in that way. For example:

                         SPY     AAPL
DateTime        
2014-01-02 09:30:00     183.91  555.890
2014-01-02 09:31:00     183.89  556.060
2014-01-02 09:32:00     183.90  556.180
2014-01-02 09:33:00     184.00  556.550
2014-01-02 09:34:00     183.98  556.325
2014-01-02 09:35:00     183.89  554.620
2014-01-02 09:36:00     183.83  554.210

I need to get something like

                         SPY     AAPL
DateTime        
2014-01-02 09:30:00     183.91  555.890
2014-01-02 09:35:00     183.89  554.620

The natural method would be resample() or asfreq() with Pandas. They indeed produce what I need, however with some undesired output as well. My sample has no observations from 4pm of a given weekday until 9:30am of the following day because trading halts during these hours. These mentioned methods end up completing the dataframe with NaN during these periods when there is actually no data to resample from. Is there any option I can use to avoid this behavior? From 4:05pm until 9:25am of the following day I get lots of NaN and just that!

My quick and dirty solution was the following:

Prices_5min = Prices[np.remainder(Prices.index.minute, 5) == 0]

Although I believe that this a quick and elegant solution, I'd would assume that resample() has some option to perform this task. Any ideas? Thanks a lot!


EDIT: Following the comment regarding the undesired output, I add the following code to showcase the problem:

New_Prices = Prices.asfreq('5min')
New_Prices.loc['2014-01-02 15:50:00':'2014-01-03 9:05:00']
Out:
                         SPY    AAPL
DateTime        
2014-01-02 15:50:00     183.12  552.83
2014-01-02 15:55:00     183.08  552.89
2014-01-02 16:00:00     182.92  553.18
2014-01-02 16:05:00     NaN     NaN
2014-01-02 16:10:00     NaN     NaN
...     ...     ...
2014-01-03 08:45:00     NaN     NaN
2014-01-03 08:50:00     NaN     NaN
2014-01-03 08:55:00     NaN     NaN
2014-01-03 09:00:00     NaN     NaN
2014-01-03 09:05:00     NaN     NaN

All these NaN should be part of the final result. They are only there because there were no trading hours. I want to avoid that.

Raul Guarini Riva
  • 651
  • 1
  • 10
  • 20
  • 1
    `df.resample('5Min').first()` and `df.asfreq('5min')` works as you say for the sample dataframe, wil be good if you create a sample for the rest of your problem statement with the expected output – anky Jul 27 '20 at 16:31
  • @anky just added a quick example showcasing the problem! – Raul Guarini Riva Jul 27 '20 at 16:47

3 Answers3

0

You could simply discard the rows containing NaN values with dropna().

Demo with a slightly modified version of your input data:

                        SPY     AAPL
DateTime                            
2014-01-02 09:30:00  183.91  555.890
2014-01-02 09:31:00  183.89  556.060
2014-01-02 09:32:00  183.90  556.180
2014-01-02 09:33:00  184.00  556.550
2014-01-02 09:34:00  183.98  556.325
2014-01-02 09:45:00  183.89  554.620
2014-01-02 09:46:00  183.83  554.210

Straight resampling gives rows with NaN values:

df.asfreq('5min')

                        SPY    AAPL
DateTime                           
2014-01-02 09:30:00  183.91  555.89
2014-01-02 09:35:00     NaN     NaN
2014-01-02 09:40:00     NaN     NaN
2014-01-02 09:45:00  183.89  554.62

which go avay with dropna():

df.asfreq('5min').dropna()

                        SPY    AAPL
DateTime                           
2014-01-02 09:30:00  183.91  555.89
2014-01-02 09:45:00  183.89  554.62
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

Overview: Create an interval index to describe trading times (0930 to 1400 on business days). Then find the time stamps (from resample) that are in the trading window.

import pandas as pd

bdate_range = pd.bdate_range(start='2014-01-02', periods=5)
bdate_range

trading_windows = [
    (d + pd.Timedelta('9.5h'), d + pd.Timedelta('16h'))
    for d in bdate_range
]
trading_windows

trading_windows = pd.IntervalIndex.from_tuples(trading_windows)

for t in trading_windows: print(t)

(2014-01-02 09:30:00, 2014-01-02 16:00:00]
(2014-01-03 09:30:00, 2014-01-03 16:00:00]
(2014-01-06 09:30:00, 2014-01-06 16:00:00]
(2014-01-07 09:30:00, 2014-01-07 16:00:00]
(2014-01-08 09:30:00, 2014-01-08 16:00:00]

...and created a list of the 5-minute intervals from your example (some during trading hours, other time stamps when trading is halted)

stamps = [
    '2014-01-02 15:50:00',
    '2014-01-02 15:55:00',
    '2014-01-02 16:00:00',
    '2014-01-02 16:05:00',
    '2014-01-02 16:10:00',
]
stamps = pd.to_datetime(stamps)

Then, I used the .contains() method of Interval Index to determine whether a timestamp (from resample) is during the trading window:

mask = [trading_windows.contains(stamp).any() for stamp in stamps]
stamps[mask]


[3]:
DatetimeIndex(['2014-01-02 15:50:00', '2014-01-02 15:55:00',
               '2014-01-02 16:00:00'],
              dtype='datetime64[ns]', freq=None)

This keeps all time stamps during the trading window (whether there are actual trades or not). And you can include holidays in the creation of 'trading_windows'.

jsmart
  • 2,921
  • 1
  • 6
  • 13
0

Possibly resampling at 5 min freq along with 'last' statistic must work in ur case U can specify the labels as the right and include the right end in the resampling

Finally, u can apply ffill in the to avoid time leakage

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 01 '22 at 00:39