1

Given this dataset:

beginning dataset

... I want to create open, high and low columns, resampled to the beginning minute of each row. Note that we cannot just simply use .resample() in this case. What I'm looking to end up with is a dataset that looks like this:

desired dataset

I'd like to NOT use a for loop for this, rather a column calculation for the open, high and low columns (unless there's an even faster way to do this, or if .resample() could somehow work in this case).

The time column is in pd.to_datetime() format.

I tried to do something like this for the max column:

tick_df['tick_high'] = tick_df[(tick_df['time'] >= tick_df['time'].replace(second=0)) & (tick_df['time'] <= tick_df['time'])].max()

...the logic here being, select the rows that are between the current datetime's time at the top of the minute (so 0 seconds), and going to the current row's datetime. So example would be between 2022-02-11 19:57:00 to 2022-02-11 19:57:20 if looking at the first row.

However when I try this, I get the error:

TypeError: replace() got an unexpected keyword argument 'second'

...because technically I'm using pandas' replace function, not the datetime.replace function. So I also tried adding in .dt before the .replace and got this one:

AttributeError: 'DatetimeProperties' object has no attribute 'replace'

Any suggestions on how I can achieve the desired output? For reference, here is my reproducible code:

from datetime import datetime
import pandas as pd

# create a mock tick df
tick_time = ["2022-02-11 19:57:20",
              "2022-02-11 19:57:40",
              "2022-02-11 19:58:01",
              "2022-02-11 19:58:09",
              "2022-02-11 19:58:31",
              "2022-02-11 19:58:45",
              "2022-02-11 19:58:58",
              "2022-02-11 19:59:00",
              "2022-02-11 19:59:20",
              "2022-02-11 19:59:40",
              "2022-02-11 19:59:55"]
tick_time = pd.to_datetime(tick_time)
tick_df = pd.DataFrame(
    {
        "time": tick_time,
        "tick_close": [440.39,440.38,440.39,440.40,440.41,440.42,440.45,440.50,440.52,440.51,440.59],
    },
)
print(tick_df)

# Attempt to resample ticks ohlc from the beginning of each minute
tick_df['tick_high'] = tick_df[(tick_df['time'] >= tick_df['time'].dt.replace(second=0)) & (tick_df['time'] <= tick_df['time'])].max()

I will be back tomorrow to review answers. Thanks!

wildcat89
  • 1,159
  • 16
  • 47

2 Answers2

1

Base on GitHub ticket we can do with map

tick_df['time'].map(lambda x : x.replace(second=0))

To get your output

cond1 = tick_df['time'].map(lambda x : x.replace(second=0))
tick_df['tick_high'] = [tick_df.loc[(tick_df['time']>=x) & (tick_df['time']<=y) ,'tick_close'].max() for x, y in zip(cond1,tick_df['time'])]

tick_df
Out[552]: 
                  time  tick_close  tick_high
0  2022-02-11 19:57:20      440.39     440.39
1  2022-02-11 19:57:40      440.38     440.39
2  2022-02-11 19:58:01      440.39     440.39
3  2022-02-11 19:58:09      440.40     440.40
4  2022-02-11 19:58:31      440.41     440.41
5  2022-02-11 19:58:45      440.42     440.42
6  2022-02-11 19:58:58      440.45     440.45
7  2022-02-11 19:59:00      440.50     440.50
8  2022-02-11 19:59:20      440.52     440.52
9  2022-02-11 19:59:40      440.51     440.52
10 2022-02-11 19:59:55      440.59     440.59
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hmmm. Well yes that does work, however I'm not getting the results I need for the high column as they're all just NaN's. – wildcat89 Mar 06 '22 at 04:38
  • @MattWilson check the update – BENY Mar 06 '22 at 04:50
  • That's great, although it uses a for loop which I'm trying to avoid due to the fact that my real dataset has several hundred thousand rows. If it's unavoidable, it is what it is, but wondering if there's another solution sans for loop??? Either way, thanks a lot for the assistance, if no other answers come by tomorrow I'll mark this as answered. – wildcat89 Mar 06 '22 at 04:58
1

IIUC, do you want?

i = pd.Index(['first','cummax','cummin'])
tick_df.join(
    pd.concat([tick_df.groupby(pd.Grouper(key='time', freq='T'))['tick_close']
                      .transform(c)
                      .rename(f'tick_{c}') 
               for c in i], axis=1)
)

Output:

                  time  tick_close  tick_first  tick_cummax  tick_cummin
0  2022-02-11 19:57:20      440.39      440.39       440.39       440.39
1  2022-02-11 19:57:40      440.38      440.39       440.39       440.38
2  2022-02-11 19:58:01      440.39      440.39       440.39       440.39
3  2022-02-11 19:58:09      440.40      440.39       440.40       440.39
4  2022-02-11 19:58:31      440.41      440.39       440.41       440.39
5  2022-02-11 19:58:45      440.42      440.39       440.42       440.39
6  2022-02-11 19:58:58      440.45      440.39       440.45       440.39
7  2022-02-11 19:59:00      440.50      440.50       440.50       440.50
8  2022-02-11 19:59:20      440.52      440.50       440.52       440.50
9  2022-02-11 19:59:40      440.51      440.50       440.52       440.50
10 2022-02-11 19:59:55      440.59      440.50       440.59       440.50
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thank you, but no that's just resampling by 1 minute. The desired output is posted in the question. I know it's a tricky one. Essentially, when a new minute happens (example 2022-02-11 19:58:00) the tick_open, tick_high, tick_low and tick_close are all = tick_close. Then, as the minute progresses, those agg functions take effect. Then at the next minute (2022-02-11 19:59:00) they would "reset" again. There should be the same number of rows as the input df. I'm not even sure how to make this work haha but it's needed and I'm stumped! – wildcat89 Mar 06 '22 at 04:53
  • @MattWilson See update. – Scott Boston Mar 06 '22 at 05:16