Filling NaN using the mean values from the same hour of previous days

Question

I want to fill NaN using a mean of values from the same hour-minutes of the previous days. To simplify, this is an example of my df.

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
...
25/04/2016 10:00 NaN
25/04/2016 10:05 NaN

The real data contains many days with continuous 5-minute intervals.

df = df.groupby(df.index.minute).fillna(df.data.rolling(3).mean()) attempted to do rolling mean from the previous hour-minute on the past days but it did not work.

An alternative method of df = df.groupby(df.index.minute).ffill() take values from the previous two lines (i.e. 7 and 8) which comes from the same minutes of the previous hour on the same day.

However, I want the following result:

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
25/04/2016 10:00 3
25/04/2016 10:05 4

where the value 3 (second last line) is the mean of the values from the same hour-minute of the previous days (mean of 1, 3, and 5), and 4 (last line) is the mean of 2, 4, and 6. Given the size of my df, I want to take a mean from tens of previous days.

Edit
I am getting closer. Using the following code, the mean of the data is calculated by similar hours and minutes as I wanted:

df.set_index('timstamp', inplace=True)
df=df.groupby([df.index.hour, df.index.minute]).mean()
df.index.names = ["hour", "minute"]

But, it uses the entire data to get the hour-minute mean. What I wanted is to use the same hour-minutes from the previous days only, where I can set the number of past days in the calculation. Then, the resulted mean value shall be used to fill the NaN.

Quang Hoang · Answer 1 · 2019-04-12T13:03:31.797

Let's try this:

# time sample every 5 mins
idx = pd.date_range('2018-01-01', '2018-01-31', freq='300s')
np.random.seed(2019)

# create toy data
df = pd.DataFrame({'idx':idx,
                   'data':np.random.uniform(0,5, len(idx))})
df.loc[np.random.uniform(0,1,len(idx)) > 0.95, 'data'] = None

# means by the hour, can also use median
means = df.resample('H', on='idx').data.mean()

# get the timestamp on the hour
df['hour'] = df['idx'] - pd.to_timedelta(df.idx.dt.minute, unit='m')

# get the hour stamp of previous day
df['hour'] -= pd.to_timedelta(1, unit='d')

# update NaN
# df.loc[df.data.isna(), 'data'] = means[nan_hour]

# the original mapping raised a ValueError due to duplicates in nan_hour
df.loc[df.data.isna(), 'data'] = df.loc[df.data.isna(), 'hour'].\   
                                    replace({'hour': means})

here, the last line returns ValueError: cannot reindex from a duplicate axis — k.ko3n, Apr 12 '19 at 08:47

Filling NaN using the mean values from the same hour of previous days

1 Answers1