Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN
entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN
-- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean
?)
I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"])
, but I feel there must be a more pythonic/pandas-y way?
Sample data set:
import numpy as np
import pandas as pd
# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))
# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan
print df["2014-01-04 08:00":"2014-01-04 16:00"]
A B C
2014-01-04 08:00:00 0.675720 2.186484 -0.033969
2014-01-04 08:30:00 -0.897217 1.332437 -2.618197
2014-01-04 09:00:00 0.299395 0.837023 1.346117
2014-01-04 09:30:00 0.223051 0.913047 NaN
2014-01-04 10:00:00 NaN 1.395480 NaN
2014-01-04 10:30:00 NaN -0.800921 NaN
2014-01-04 11:00:00 NaN -0.932760 NaN
2014-01-04 11:30:00 0.057219 -0.071280 NaN
2014-01-04 12:00:00 0.215810 -1.099531 NaN
2014-01-04 12:30:00 -0.532563 NaN NaN
2014-01-04 13:00:00 -0.697872 NaN NaN
2014-01-04 13:30:00 -0.028541 NaN NaN
2014-01-04 14:00:00 -0.073426 NaN NaN
2014-01-04 14:30:00 -1.187419 0.221636 NaN
2014-01-04 15:00:00 1.802449 0.144715 NaN
2014-01-04 15:30:00 0.446615 1.013915 -1.813272
2014-01-04 16:00:00 -0.410670 1.265309 -0.198607
[17 rows x 3 columns]
(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )
/edit: A sample solution that I'm not really happy with:
# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
for idx, rows in df.iterrows():
if np.isnan(df.ix[idx, col]):
# replace with mean of adjacent days
df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col])
There is two things I don't like about this solution:
- If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
- I would like to get rid of the loops, if that is possible.