0

I have a data frame with the following columns: {'day','measurement'}

And there might be several measurements in a day (or no measurements at all)

For example:

day     |    measurement
1       |     20.1
1       |     20.9
3       |     19.2
4       |     20.0
4       |     20.2

and an array of coefficients: coef={-1:0.2, 0:0.6, 1:0.2}

My goal is to resample the data and average it using the coefficiets, (missing data should be left out).

This is the code I wrote to calculate that

window=[-1,0,-1]
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()

For the example above, the output should be:

Day  measurement
1    20.500
2    19.850
3    19.425
4    19.875

The problem is that the code runs forever, and I'm pretty sure that there's a better way to resample with coefficients.

Any advice would be highly appreciated !

Zero
  • 74,117
  • 18
  • 147
  • 154
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • Could you please help me understand how the correlations translate to the expected output above? My understanding was that on day 4, for example, you would want `(0.2 * 19.2 + 0.6 * 20.1) / 0.8` which is `19.875`, not `19.97`. If you could walk through just day 4 or the day 3 calculation, that'd help. – S Anand Apr 20 '15 at 14:08
  • @UriGoren are the Day 2, 3 measurements accurate as expected? I think, you should update these too! – Zero Apr 20 '15 at 14:15

1 Answers1

2

Here's a possible solution to what you're looking for:

        # This is your data
In [2]: data = pd.DataFrame({
   ...:     'day': [1, 1, 3, 4, 4],
   ...:     'measurement': [20.1, 20.9, 19.2, 20.0, 20.2]
   ...: })

        # Pre-compute every day's average, filling the gaps
In [3]: measurement = data.groupby('day')['measurement'].mean()

In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1))

In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2})

        # Create a matrix with the time-shifted measurements
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()})

In [7]: matrix
Out[7]:
       -1     0     1
day
1     NaN  20.5   NaN
2    19.2   NaN  20.5
3    20.1  19.2   NaN
4     NaN  20.1  19.2

        # Take a weighted average of the matrix
In [8]: (matrix * coef).sum(axis=1) / (matrix.notnull() * coef).sum(axis=1)
Out[8]:
day
1    20.500
2    19.850
3    19.425
4    19.875
dtype: float64
S Anand
  • 11,364
  • 2
  • 28
  • 23