5

Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)

I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?

Sample data set:

import numpy as np
import pandas as pd

# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))

# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan

print df["2014-01-04 08:00":"2014-01-04 16:00"]

                            A         B         C
2014-01-04 08:00:00  0.675720  2.186484 -0.033969
2014-01-04 08:30:00 -0.897217  1.332437 -2.618197
2014-01-04 09:00:00  0.299395  0.837023  1.346117
2014-01-04 09:30:00  0.223051  0.913047       NaN
2014-01-04 10:00:00       NaN  1.395480       NaN
2014-01-04 10:30:00       NaN -0.800921       NaN
2014-01-04 11:00:00       NaN -0.932760       NaN
2014-01-04 11:30:00  0.057219 -0.071280       NaN
2014-01-04 12:00:00  0.215810 -1.099531       NaN
2014-01-04 12:30:00 -0.532563       NaN       NaN
2014-01-04 13:00:00 -0.697872       NaN       NaN
2014-01-04 13:30:00 -0.028541       NaN       NaN
2014-01-04 14:00:00 -0.073426       NaN       NaN
2014-01-04 14:30:00 -1.187419  0.221636       NaN
2014-01-04 15:00:00  1.802449  0.144715       NaN
2014-01-04 15:30:00  0.446615  1.013915 -1.813272
2014-01-04 16:00:00 -0.410670  1.265309 -0.198607

[17 rows x 3 columns]

(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )

/edit: A sample solution that I'm not really happy with:

# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
    for idx, rows in df.iterrows():
        if np.isnan(df.ix[idx, col]):
            # replace with mean of adjacent days
            df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col]) 

There is two things I don't like about this solution:

  1. If there is a single line missing or duplicated anywhere, this fails. In the last line, I would like to subtract "one day" all the time, no matter if that is 47, 48 or 49 rows away. Also, it would be good of I could extend the range (e.g. -3 days to +3 days) without manually writing a list for the index.
  2. I would like to get rid of the loops, if that is possible.
user1251007
  • 15,891
  • 14
  • 50
  • 76
Fred S
  • 1,421
  • 6
  • 21
  • 37
  • 1
    Is it possible that you may have more than 5 days missing? So you wouldn't be able to do +/- 2 days either way? – Ffisegydd Sep 18 '14 at 09:36
  • In principle, yes, it can happen. But if that is the case, I have a whole different problem and the technique that I described isn't really applicable anymore. So for our problem, **no, we don't need to consider that case**. – Fred S Sep 18 '14 at 10:40
  • This doesn't do exactly what you asked for, but you should be aware of it: http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html#scipy.interpolate.interp1d - SciPy's interp1d can do various forms of spline fitting and linear interpolation. You could use it on each time of day separately I suppose. – John Zwinck Sep 18 '14 at 12:46
  • John-- thank you. I am aware of that function (and Pandas also has a built-in interpolate method); sadly, this is not really what I am looking for / what I described is an established method when it comes to the data I have. – Fred S Sep 18 '14 at 12:56
  • What happens at the edges where you cannot go -2, or +2 days? – Woody Pride Sep 18 '14 at 16:10
  • Pragmatic answer: For now, I make sure that it doesn't happen. I'm sure I will have to address this in the future, but I don't really know how to solve that problem yet. Obvious answers would be to decrease the window size at the edges, or to "mirror" the time series. – Fred S Sep 18 '14 at 20:00

1 Answers1

3

This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:

df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )

It turned out to be really hard to generalize this, but this seems to work:

df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in 
                                     range(-48*window,48*(window+1),48) ], axis=0 )

I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.

Check results for window=1:

print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]    

2014-01-04 12:30:00    0.940193     # was nan, now filled
2014-01-04 13:00:00    0.078160
2014-01-04 13:30:00   -0.662918
2014-01-04 14:00:00   -0.967121

2014-01-03 12:30:00    0.947915     # day before
2014-01-03 13:00:00    0.167218
2014-01-03 13:30:00   -0.391444
2014-01-03 14:00:00   -1.157040

2014-01-05 12:30:00    0.932471     # day after
2014-01-05 13:00:00   -0.010899
2014-01-05 13:30:00   -0.934391
2014-01-05 14:00:00   -0.777203

Regarding problem #2, it will depend on your data but if you precede the above with

df = df.resample('30min')

that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.

Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:

df.groupby( df.index.hour ).fillna(method='pad')

would correctly deal the issue of missing rows, but not the other things.

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Thanks, I will give that a try as soon as I'm back in the office tomorrow. This does seem solve problem no. 2 with my sample solution, but not no. 1. I played around with `datenum.timedelta` in order to subtract / add one day to my date index, but I couldn't really get it to work (may post examples tomorrow). Any idea about that? Chances are good I will soon get to work with data that may have missing rows (i.e. two days are not necessarily 48 rows apart). – Fred S Sep 18 '14 at 20:04
  • 1
    Sorry, I'm not sure about that. I misread and thought the data was more regular. Perhaps resample() and then apply the above method? If your data is consistently spaced except for some missing rows, then resampling ought to work fine and be easy to do. I'm sure you could do something with groupby but that could be a lot slower. Maybe someone else will have a better idea though. – JohnE Sep 18 '14 at 20:13
  • Dear JohnE-- Thank you. Good idea using resample, that should indeed work. It's not exactly what I had in mind, but it should do the job. I will implement it in a few hours and mark your answer as accepted as soon as I get it to work. Sorry if I was a bit unclear in my question. I may open another question on shifting datetime indices by a certain amount of time instead of rows. – Fred S Sep 19 '14 at 06:34