pandas resample uneven hourly data into 1D or 24h bins

Question

I have weekly hourly FX data which I need to resample into '1D' or '24hr' bins Monday through Thursday 12:00pm and at 21:00 on Friday, totaling 5 days per week:

Date                 rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957

So the number of hours in each some days of the week is uneven, ie "Monday" = 00:00:00 Monday through 12:00:00 Monday, "Tuesday" (and also Weds, Thu) = i.e. 13:00:00 Monday though 12:00:00 Tuesday, and Friday = 13:00:00 through 21:00:00

In trying to find a solution I see that base is now deprecated, and offset/origin methods aren't working as expected, likely due to uneven number of rows per day:

df.rate.resample('24h', offset=12).ohlc()

I've spent hours attempting to find a solution

How can one simply bin into ohlc() columns all data rows between each 12:00:00 timestamp?

the desired output would look something like this:

Out[69]: 
                                   open      high       low     close
2020-01-02 00:00:00.0000000  0.673355  0.673355  0.673355  0.673355
2020-01-03 00:00:00.0000000  0.673110  0.673110  0.668830  0.669570
2020-01-04 00:00:00.0000000  0.668280  0.668280  0.664950  0.666395
2020-01-05 00:00:00.0000000  0.666425  0.666425  0.666425  0.666425

@BillHuang added the desired output which is the Open, High, Low and Close for each day — trock2000, Nov 05 '20 at 19:45
I still don't get where the problem is. `df.rate.resample("D").ohlc()` should be working for different number of rows and different timings over each weekday (tested on pandas 1.1.3 and python 3.7.9). Therefore, if you don't want the rows outside the opening hours, why not just discard them before resampling? To me, it really looks like your problem has nothing to do with resampling. — Bill Huang, Nov 05 '20 at 20:34
sorry my question wasn't more clear - when I use df.rate.resample("D").ohlc(), this takes the last row for the calendar day Monday (23:00:00 in my case ) as the closing row for that date. I need the day to close at 12:00:00. Using offset, origin, base, etc do not work. — trock2000, Nov 05 '20 at 20:43

score 1 · Answer 1 · answered Nov 05 '20 at 21:14

1

Is this what you are looking for, using both origin and offset as parameters:

df.resample('24h', origin='start_day', offset='13h').ohlc()

For your example, this gives me:

                    open        high        low     close
datetime                
2020-01-01 13:00:00 0.673355    0.673355    0.66884 0.66946
2020-01-02 13:00:00 0.670160    0.671240    0.66883 0.66957

answered Nov 05 '20 at 21:14

Sander van den Oord

10,986
5
51
96

thanks this was close but failed to work exactly due to Sunday entries – trock2000 Nov 06 '20 at 14:29

score 1 · Answer 2 · answered Nov 05 '20 at 21:53

Since the period lengths are unequal, IMO it is necessary to craft the mapping wheel yourself. Speaking precisely, the 1.5-day length on Monday makes it impossible for freq='D' to do the mapping correctly at once.

The hand-crafted code is also able to guard against records outside the well-defined periods.

Data

A slightly different timestamp is used to demonstrate the correctness of the code. The days are from Mon. to Fri.

import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp, Timedelta

df = pd.read_csv(io.StringIO("""
                         rate
Date                         
2020-01-06 00:00:00  0.673355
2020-01-06 23:00:00  0.673110
2020-01-07 00:00:00  0.672925
2020-01-07 12:00:00  0.672240
2020-01-07 13:00:00  0.671980
2020-01-07 23:00:00  0.672230
2020-01-08 00:00:00  0.671895
2020-01-08 12:00:00  0.672175
2020-01-08 23:00:00  0.672085
2020-01-09 00:00:00  0.670870
2020-01-09 12:00:00  0.670580
2020-01-09 23:00:00  0.668840
2020-01-10 00:00:00  0.669460
2020-01-10 12:00:00  0.670160
2020-01-10 21:00:00  0.670560
2020-01-10 22:00:00  0.671240
2020-01-10 23:00:00  0.669170
"""), sep=r"\s{2,}", engine="python")

df.set_index(pd.to_datetime(df.index), inplace=True)

Code

def find_day(ts: Timestamp):
    """Find the trading day with irregular length"""

    wd = ts.isoweekday()
    if wd == 1:
        return ts.date()
    elif wd in (2, 3, 4):
        return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
    elif wd == 5:
        if ts.hour <= 12:
            return ts.date() - Timedelta("1D")
        elif 13 <= ts.hour <= 21:
            return ts.date()

    # out of range or nulls
    return None

# map the timestamps, and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)), inplace=True)

# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()

Result

print(ans)

                open      high       low     close
Date                                              
2020-01-06  0.673355  0.673355  0.672240  0.672240
2020-01-07  0.671980  0.672230  0.671895  0.672175
2020-01-08  0.672085  0.672085  0.670580  0.670580
2020-01-09  0.668840  0.670160  0.668840  0.670160
2020-01-10  0.670560  0.670560  0.670560  0.670560

thanks for sharing this - I arrived at a similar solution that was slightly different using different day calculations, ie "Monday" = 13:00:00 Friday through 12:00:00 Monday - see my provided answer — trock2000, Nov 06 '20 at 14:31

trock2000 · Accepted Answer · 2020-11-06T16:38:49.747

I ended up using a combination of grouby and datetime day of the week identification to arrive at my specific solution

# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
    days.append(df.index[i].date().weekday())
df['day'] = days

dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates

for i in range(len(grouped_dates)):
    tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
    if tempDf['day'][0] == 6:
        continue # skip Sundays
    times = [] # gets just the time portion of index
    for y in range(len(tempDf.index)):
        times.append(str(tempDf.index[y])[-8:])
    tempDf['time'] = times # add time column to df
    tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row  
    dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1, 'idx'][0]) # idx value

pandas resample uneven hourly data into 1D or 24h bins

3 Answers3

Data

Code

Result