I have 1 minute ohlcv data in a pandas dataframe. I want to resample it with 5 minute intervals and see if the high of the 5 minutes was hit first, or the low, all the while keeping the ohlcv values.
Input is like
datetime | open | high | low | close |
---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 12 | 9 | 11 |
2022-01-01 10:01:00 | 11 | 14 | 9 | 12 |
2022-01-01 10:02:00 | 12 | 12 | 8 | 10 |
2022-01-01 10:03:00 | 10 | 15 | 9 | 11 |
2022-01-01 10:04:00 | 10 | 12 | 8 | 11 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 |
Output is like
datetime | open | high | low | close | high_first |
---|---|---|---|---|---|
2022-01-01 10:00:00 | 10 | 15 | 8 | 11 | 0 |
2022-01-01 10:05:00 | 11 | 11 | 9 | 10 | 1 |
First, I tried the simple way:
I would find the indexes of where high would reach its max and low would reach its min, then I would add another column comparing those.
df.resample("15min").agg({"high": ["idxmax", "max"], "low": ["idxmin", "min"]})
But I got an error:
ValueError: attempt to get argmax of an empty sequence
Because my data is not continuous (it cuts off at holidays)
So, left to my own devices, I wrote an apply function:
def transform(x):
x["open_first"] = x["open"][0]
x["close_last"] = x["close"][-1]
x["high_max"] = x["high"].max()
x["low_min"] = x["low"].min()
x["high_idxmax"] = x["high"].idxmax()
x["low_idxmin"] = x["low"].idxmin()
x["volume_sum"] = x["volume"].sum()
x["high_first"] = x["high_idxmax"] < x["low_idxmin"]
return x
But it is very slow. Is it possible to make it faster and avoid df.apply
?