1

I am downsampling data from 15 minutes scale to hourly scale with pandas resample.mean(). The result has NaN values.

Data C_hourly

Data = Data.set_index('readable_time')
C_hourly = Data.groupby(['sensor_no'],as_index=False).resample('H').mean()

I wanted to have the mean of 15 minute temperature records and its corresponding sensor numbers. I am wondering if it works when I split the csv file into n files for n sensors.

Scholar7
  • 11
  • 1

1 Answers1

1

When using pd.DataFrame.resample a timerange is generated ranging from the smallest to the largest time entry present, the present values are aggregated on these bins. Bins without values get NaN. In order to avoid this behaviour use pd.Grouper.

This isue can be visualised using

df= pd.DataFrame([[1,25.3,'2023-01-16 00:08:19'],
                  [1,25.1,'2023-01-16 00:23:19'],
                  [1,25.0,'2023-01-16 02:08:19']],
                 columns = ['sensor_no','temp','readable_time'])

On an hourly basis the lowest entry is 2023-01-16 00:00:00 and the highest 2023-01-16 02:00:00 as such resample('H') will create a range index with three entries

DatetimeIndex(['2023-01-16 00:00:00', '2023-01-16 01:00:00',
               '2023-01-16 02:00:00'],
              dtype='datetime64[ns]', name='readable_time', freq='H')

In our dataframe df the first two entries will correspond to the first index and the third one to the third entry. As such there will be no entry to average over for the second index leaving behind a NaN. Indeed

df.set_index('readable_time').groupby(['sensor_no'],as_index=False).resample('H').mean()
sensor_no temp
2023-01-16 00:00:00 1 25.2
2023-01-16 01:00:00 nan nan
2023-01-16 02:00:00 1 25

As mentioned in to solely aggregate on the rounded timevalue instead use a pd.Grouper

df.groupby([pd.Grouper(key='readable_time',freq='H'),'sensor_no']).mean()

This results in

temp
2023-01-16 00:00:00 1 25.2
2023-01-16 02:00:00 1 25
Arnau
  • 741
  • 1
  • 4
  • 8