1

Within Pandas, I would like to resample my dataframe and take the mean within a 5 hour period and within index level. My dataframe looks like: df

            timestamp       width  length
name                                    
10    2019-08-01 00:00:00   10.1    86.1
10    2019-08-01 00:00:10   10.0    86.2
10    2019-08-01 00:05:40   10.1    86.3
10    2019-08-01 00:05:50   10.0    86.2
8     2019-08-01 00:05:54   12.0   110.0

I would like to keep my 'name' variable as index (preferably not setting timestamp as index), like:

            timestamp       width  length
name                                    
10    2019-08-01 00:00:05   10.05   86.15
10    2019-08-01 00:05:45   10.05   86.25
8     2019-08-01 00:05:54   12.0    110.0

I tried:

df_resample = df.resample('5H', on='timestamp').mean()

But this will not perform within index level. Also it sets the datetime on the index which I try to avoid.

Jeroen
  • 801
  • 6
  • 20

2 Answers2

1

IIUC, you can use groupby and resample:

(df.groupby(level=0, sort=False)
   .resample('5min', on='timestamp').mean()
   .reset_index()
)

This however, does not average your timestamps, since you can't really add Datetime type in pandas, although there are ways around that.

   name           timestamp  width  length
0    10 2019-08-01 00:00:00  10.05   86.15
1    10 2019-08-01 00:05:00  10.05   86.25
2     8 2019-08-01 00:05:00  12.00  110.00

Update If you want mean timestamp, you can temporary convert timestamp to int, taking mean, and convert back:

(df.assign(int_time=lambda x: x['timestamp'].astype('int64') )
   .groupby(level=0, sort=False)
   .resample('5min', on='timestamp').mean()
   .reset_index()
   .assign(timestamp=lambda x: x['int_time'].astype('int64').astype('datetime64[ns]'))
   .drop('int_time', axis=1)
)

Output:

   name           timestamp  width  length
0    10 2019-08-01 00:00:05  10.05   86.15
1    10 2019-08-01 00:05:45  10.05   86.25
2     8 2019-08-01 00:05:54  12.00  110.00
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thank you for the reply. The solution is good enough, preferably I would get the average timetamps. – Jeroen Feb 26 '20 at 14:52
  • @Jeroen like I said, there are ways to go around that. See updated answer. – Quang Hoang Feb 26 '20 at 14:58
  • The solution looks good. However, when converting back: I get an error at the second last line: astype('int64') " ValueError: Cannot convert non-finite values (NA or inf) to integer". – Jeroen Feb 26 '20 at 15:19
  • That means you have `NaT` time in your `timestamp` You need to decide what to do with those. – Quang Hoang Feb 26 '20 at 15:21
  • I actually have 0 NaN in my timestamp. I checked this by:`count_nan = len(df_area['timestamplast']) - df_area['timestamplast'].count()` – Jeroen Feb 27 '20 at 16:41
  • How about `df_area['timestamplast'].isna().sum()`? – Quang Hoang Feb 27 '20 at 16:47
  • You could try chaining the first `.astype('int64')` with `.astype('float64')` so that `mean()` wouldn't overflow. If that doesn't work, them I'm sorry I'm out of idea. If you want, you can share your data and I can take a look. – Quang Hoang Feb 27 '20 at 16:55
  • thanks for your help. I got it working now by omitting: `df['timestamp'] = df['timestamp'].dt.strftime('%Y-%m-%d-%H-%M-%s').astype('datetime64[s]')` . The format was not needed and was automatically recognized. Much obliged! – Jeroen Feb 28 '20 at 16:09
0

Set timestamp as index temporarily, then reestablish the original index.

df = df.reset_index().set_index('timestamp').resample('5H').mean().set_index('name')

Is this what you desired to obtain?

Oleg O
  • 1,005
  • 6
  • 11
  • Thank you for your reply. This averages also my original index, which is sometime I don't want. By setting the index to name it removes the datetime, which is also unwanted. – Jeroen Feb 26 '20 at 14:51
  • So, don't average your index. You can make a resampler object like this `res_obj = df.reset_index().set_index('timestamp').resample('5H')` then make separate tables with separate functionality, then join, i.e. `df_mean = res_obj[some_columns].mean()` `df_last = res_obj[['name']].last()` `df = pd.concat([df_mean, df_last], axis=1).reset_index().set_index('name')` – Oleg O Feb 26 '20 at 15:12