How to resample a column based on the maximum value of another column?

Question

Say, I have a surface wind dataset with an irregular temporal step, looking like below. The actual dataset has many other columns and thousands of rows.

Time, Speed, Direction
2023-1-1 01:00:00, 6, 90
2023-1-1 02:00:00, 6, 70
2023-1-1 03:00:00, 9, 70
2023-1-1 04:00:00, 6, 230
2023-1-1 06:00:00, 2, 320
2023-1-1 08:00:00, 2, 100
2023-1-1 11:00:00, 3, 140
2023-1-1 15:00:00, 12, 10
2023-1-1 16:00:00, 13, 20
2023-1-1 17:00:00, 15, 60
2023-1-1 18:00:00, 10, 80

I'd like to resample the data into a regular time step (00 03 06 09 12 15 18 21) by calculating the maximum Speed, and the Direction corresponding to the max speed. How can this be done? I am figuring out something like this, but it does not work.

df3h = df.resample('3H').agg({ # 3H Does not work if the time series donot start at 00:00
'Speed':'max'
'Direction': lambda x, x.loc[x.Speed.idxmax(),'Direction'] # This Won't Work!
})

in case of twice the max speed in your 3h step (like in the first 2 rows), what value of direction you want? also in your real data it is an hour data or you can have a point at `2023-1-1 15:12:35` for example — Ben.T, Jan 06 '23 at 14:27
@Ben.T This is a simplified example. The real dataset likely has small variations in speed over time. I'd like to enforce the same resampling rule for entire dataset - max speed for each chunk and the direction of that max. speed. — peteron30, Jan 06 '23 at 14:30

score 4 · Answer 1 · answered Jan 06 '23 at 14:59

You can do it using groupby and idxmax before creating a new dataframe. First use floor on the Time column to get the groups of 3h.

# create 3h bins
_3h = df['Time'].dt.floor('3H')
print(_3h) # if Time is index, then do df.index.floor('3H')
# 0    2023-01-01 00:00:00
# 1    2023-01-01 00:00:00
# 2    2023-01-01 03:00:00
# 3    2023-01-01 03:00:00
# 4    2023-01-01 06:00:00
# ...

#use it to groupby and get the index of the max Speed with idxmax
_idxmax = df.groupby(_3h)['Speed'].idxmax()
print(_idxmax)
# Time
# 2023-01-01 00:00:00     0
# 2023-01-01 03:00:00     2
# 2023-01-01 06:00:00     4
# ...

# create the result dataframe
new_df = (
    df.loc[_idxmax, ['Speed','Direction']]
      .set_index(_idxmax.index)
      # in case a 3h bin is missing like 12:00 in your example
      .reindex(pd.Index(
          pd.date_range(_3h.min(), _3h.max(), freq='3h'), 
          name='Time'))
      # if you want Time as column
      .reset_index()
)
print(new_df)
#                  Time  Speed  Direction
# 0 2023-01-01 00:00:00    6.0       90.0
# 1 2023-01-01 03:00:00    9.0       70.0
# 2 2023-01-01 06:00:00    2.0      320.0
# 3 2023-01-01 09:00:00    3.0      140.0
# 4 2023-01-01 12:00:00    NaN        NaN
# 5 2023-01-01 15:00:00   15.0       60.0
# 6 2023-01-01 18:00:00   10.0       80.0

This solution works if there are no duplicate indices in the datetimeindex. But the actual data sometimes contain measurements at the same time instances, from different networks. — peteron30, Jan 06 '23 at 17:54
@peteron30 IIUC about the duplicates Time, an easy fix is to do `df = df.reset_index()` before doing anything so no more duplicates index and Time becomes a column instead of index (and the solution given here is design with Time as a column originally) — Ben.T, Jan 06 '23 at 20:21

How to resample a column based on the maximum value of another column?

1 Answers1