1

I have a big data frame with 2 groups: score and day. Is there a simple possibility with pandas tools to fill the gaps and the missing scores with the average (alternative ewma etc..) of the values before.

First of all I group overwrite the scores by grouping and later stack the modified grouped df's together.

 dfg = df.groupby(['g1','g2'])
 for name , group in dfg:
    print group
    break
 ix               g1           g2   score      day
 4                19           24    4.150513  2014-02-12
 5                19           24    6.986235  2014-02-13
 6                19           24    9.634231  2014-02-14
 7                19           24    1.818548  2014-02-15
 8                19           24    1.699897  2014-03-02
 9                19           24    2.128781  2014-03-25
 10               19           24    1.720297  2014-03-26
 14               19           24    2.079877  2014-03-30
George Netu
  • 2,758
  • 4
  • 28
  • 49
Christian
  • 193
  • 1
  • 3
  • 13
  • May I ask why you want to do this? If the whole point is an econometric regression, you are not adding any more information to your data. Any method-of-moments estimator should handle those gaps better than your manual interpolation everwill. – FooBar May 07 '14 at 08:45
  • Yes, sometimes it's business as usual and the analytical meaningfulness becomes secondary :-( – Christian May 07 '14 at 09:09

1 Answers1

1

I've never done this, but looking at the manual gave me the following as idea as a starting point:

df = pd.DataFrame([['2011-01-01', '1'], ['2011-01-03', '2']], columns=['day', 'score']).set_index('day')
df.index = df.index.to_datetime()
rng = pd.date_range('1/1/2011', periods=12, freq='d')
df2 = pd.DataFrame(pd.DataFrame(columns=['day'], index=rng))

# now, for those that we actually have data, put it in:
df2['score'] = df['score']

The final result then:

               score
2011-01-01     1
2011-01-02   NaN
2011-01-03     2
2011-01-04   NaN
2011-01-05   NaN
2011-01-06   NaN
2011-01-07   NaN
2011-01-08   NaN
2011-01-09   NaN
2011-01-10   NaN
2011-01-11   NaN
2011-01-12   NaN

Now, you can apply interpolation methods on the NaN values as described in the docs.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Thanks for the starting point. Got an error for df.index = df.index.to_datetime TypeError: 'instancemethod' object is not iterable Version: 0.13.0 – Christian May 07 '14 at 09:53
  • That's because the brackets magically went missing. Fixed the line :) – FooBar May 07 '14 at 11:52