Interpolating missing values in Python

Question

All - I hope you'll be able to help as it's one of those tasks where I know I've almost cracked from the various postings on here and online, but haven't quite got it to work.

Essentially, I have the following data in a database that is returned to a Pandas object via psql.read_sql(sql, cnxn)

+------------------------------------+
|              StartTime  StartLevel |
+------------------------------------+
| 0  2015-02-16 00:00:00     480.000 |
| 1  2015-02-16 00:30:00     480.000 |
| 2  2015-02-16 00:34:00     390.000 |
| 3  2015-02-16 01:00:00     390.000 |
| 4  2015-02-16 01:30:00     390.000 |
| 5  2015-02-16 02:00:00     480.000 |
| 6  2015-02-16 02:17:00     420.000 |
+------------------------------------+

StartTime     datetime64[ns]
StartLevel           float64
dtype: object

I simply want to end up with a minute-by-minute interpolation of the above data.

I've also created a datetime series at minute frequency but for the life of me I can't work out to "map" my table onto this and then interpolate or how I could resample the StartTime to minute granularity and then interpolate the missing data.

Any assistance would be greatly appreciated (and I am certain I am going to kick myself when I find out the solution!) - Many thanks

UPDATE

Following the suggestions below, the code is as follows:

import datetime
import numpy as np
import pandas as pd
import pyodbc
import pandas.io.sql as psql


cnxn = pyodbc.connect('DSN=MySQL;DATABASE=db;UID=uid;PWD=pwd')
cursor = cnxn.cursor()
sql = """
    SELECT
    StartTime,StartLevel
FROM
    aa.bb
    where cc = 'dd'
    and StartTime < '2015-02-16 02:30:00'
    order by StartTime asc"""

old_df = psql.read_sql(sql, cnxn)


num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require
new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)
new_df['StartLevel'] = old_df['StartLevel']
new_df.interpolate(inplace=True)

the output from new_df at the prompt is:

+-----------------------------------------------+
|              StartTime  fake_val  StartLevel  |
+-----------------------------------------------+
| 0   2015-02-16 00:00:00       NaN         480 |
| 1   2015-02-16 00:01:00       NaN         480 |
| 2   2015-02-16 00:02:00       NaN         390 |
| 3   2015-02-16 00:03:00       NaN         390 |
| 4   2015-02-16 00:04:00       NaN         390 |
| 5   2015-02-16 00:05:00       NaN         480 |
| 6   2015-02-16 00:06:00       NaN         480 |
+-----------------------------------------------+

sedavidw · Answer 1 · 2015-02-17T21:29:29.073

1

I'm quite certain this is not the most pythonic answer so I welcome comments to improve it but I believe you can do something like this

First create all the datetime objects you want values for

num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require

Then create a "fake" dataframe with those index values

new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)

EDIT: Corrected reponse

Now we want to merge the two dataframes into one (and sort by the date):

final_df = new_df.merge(df, how='outer', on='date').sort(columns='date')

final_df will now be sorted by date and contain the right values for StartLevel when you had data and NaN when you didn't have data for it. Then you can call interpolate

EDIT: Interpolate is not called inplace by default, so you either need to set that flag or save off the result

final_df = final_df.interpolate()

or

final_df.interpolate(inplace=True)

Obviously the fake_val column can be thrown out once you've merged in the good data. The purpose of creating that dataframe is to have one indexed with all the values you want (this is where I'm sure there is a more pythonic answer)

Full documentation for interpolate can be found here

edited Feb 17 '15 at 21:29

answered Feb 17 '15 at 18:25

sedavidw

11,116
13
61
95

Thanks for the suggestion but it doesn't quite work as it generates the list of times by minute, a column with NaN and then simply maps the original StartTimes onto this – Patrick A Feb 17 '15 at 19:12
@PatrickA Did you call `interpolate()`? It's possible you're not saving the result Looking at the documentation it does not automatically do that in place, so you can do something like `new_df = new_df.interpolate()` or `new_df.interpolate(inplace=True)`. I've edited the answer to reflect this – sedavidw Feb 17 '15 at 19:33
I believe so. It is generating the correct minute-by-minute time series and NaN in the fake_val columns. It also seems to be interpolating the last value out till the end of the minute-by-minute but not interpolating the StartLevel - I'd like to post the output for you on here but I'm not sure how to – Patrick A Feb 17 '15 at 20:03
@PatrickA Can you edit your question and put it there? If you could include the code you used that would be helpful too – sedavidw Feb 17 '15 at 20:11
Done - posted in the original question – Patrick A Feb 17 '15 at 20:46
@PatrickA Ah, I see the problem. Was lazy in reading my output and I apologize, just updated my answer, try that – sedavidw Feb 17 '15 at 21:04

Interpolating missing values in Python

1 Answers1

Linked