When using pd.DataFrame.resample
a timerange is generated ranging from the smallest to the largest time entry present, the present values are aggregated on these bins. Bins without values get NaN
. In order to avoid this behaviour use pd.Grouper
.
This isue can be visualised using
df= pd.DataFrame([[1,25.3,'2023-01-16 00:08:19'],
[1,25.1,'2023-01-16 00:23:19'],
[1,25.0,'2023-01-16 02:08:19']],
columns = ['sensor_no','temp','readable_time'])
On an hourly basis the lowest entry is 2023-01-16 00:00:00
and the highest 2023-01-16 02:00:00
as such resample('H')
will create a range index with three entries
DatetimeIndex(['2023-01-16 00:00:00', '2023-01-16 01:00:00',
'2023-01-16 02:00:00'],
dtype='datetime64[ns]', name='readable_time', freq='H')
In our dataframe df
the first two entries will correspond to the first index and the third one to the third entry. As such there will be no entry to average over for the second index leaving behind a NaN
. Indeed
df.set_index('readable_time').groupby(['sensor_no'],as_index=False).resample('H').mean()
|
sensor_no |
temp |
2023-01-16 00:00:00 |
1 |
25.2 |
2023-01-16 01:00:00 |
nan |
nan |
2023-01-16 02:00:00 |
1 |
25 |
As mentioned in to solely aggregate on the rounded timevalue instead use a pd.Grouper
df.groupby([pd.Grouper(key='readable_time',freq='H'),'sensor_no']).mean()
This results in
|
|
temp |
2023-01-16 00:00:00 |
1 |
25.2 |
2023-01-16 02:00:00 |
1 |
25 |