0

I have a data frame with one column of float values and timestamp as index similar to below. All the timestamp in the index are sorted in ascending order.

timestamp           value
2014-01-08 08:00:42 1
2014-01-08 08:01:00 NaN
2014-01-08 08:01:12 3
2014-01-08 08:01:52 7
2014-01-08 08:02:02 9

How do I linearly interpolate/Impute the NaN value based "only" on 1 and 3 value(i.e above and below value). The data frame is huge and can run up to 2 GB. So speed is important for this operation

Thanks

RTM
  • 759
  • 2
  • 9
  • 22
  • 1
    Have you seen https://stackoverflow.com/questions/25234782/replace-nan-or-missing-values-with-rolling-mean-or-other-interpolation? – Zero Sep 19 '17 at 15:57
  • I don't know about rolling function in pandas df. So I couldn't search by this option. Thanks for pointing to this link. Looks very relevant to my questions. Thanks – RTM Sep 19 '17 at 16:06
  • I just want to add to the question that the timestamps are not at equal intervals. They vary a lot. So the interpolation is also based on the timestamp and how far and close they are. – RTM Sep 19 '17 at 16:37

1 Answers1

2

Edit: you updated your question to interpolate based on the timestamp differences in the index, applying those proportions to your DataFrame values.

Here's a crude way of going about that that should actually be decently quick:

ts = pd.to_datetime(['2014-01-08 08:00:42',
                     '2014-01-08 08:01:00',
                     '2014-01-08 08:01:12',
                     '2014-01-08 08:01:52',
                     '2014-01-08 08:02:02'])

df = pd.DataFrame([1, np.nan, 3, 7, 9], index=ts, columns=['value'])
td = df.index.to_series().diff().dt.total_seconds()
interp = df.value.shift(1) + (df.value.shift(-1) - df.value.shift(1)) \
       * td / (td.shift(-1) + td)

df['value'] = df['value'].fillna(interp)
print(df)
                     value
2014-01-08 08:00:42    1.0
2014-01-08 08:01:00    2.2
2014-01-08 08:01:12    3.0
2014-01-08 08:01:52    7.0
2014-01-08 08:02:02    9.0

In this example, the NaN at the second position has timedelta differences that are 18 units below and 12 units above, respectively. Therefore it should be filled with the value 1 + (3 - 1) * (18 / (18 + 12)).

You may be able to simplify that math as well.

One other thing: While working on this answer, I asked a question of my own about calculating the timedeltas. @piRSquared has a faster solution here if speed matters to that extent to you.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Thanks. I made few changes to the impute function that you specified earlier by asking it refer to the index (which is time) and interpolate based on the distance between time. It worked really well and the code looks clean :) – RTM Sep 25 '17 at 16:16