0

Currently im working on a script to fill in missing values in a timeline. The dataframe is looking thusfar, values are examples:

timestamp   id  value
2016-01-01 01:00:00 1   10
2016-01-01 02:00:00 1   15
2016-01-01 03:00:00 1   12
2016-01-01 04:00:00 1   NAN
2016-01-01 05:00:00 1   NAN
2016-01-01 06:00:00 1   7
2016-01-01 07:00:00 1   9
2016-01-01 01:00:00 2   10
2016-01-01 02:00:00 2   12
2016-01-01 03:00:00 2   14
2016-01-01 04:00:00 2   16
2016-01-01 05:00:00 2   8
2016-01-01 06:00:00 2   11
2016-01-01 07:00:00 2   14
2016-01-01 01:00:00 3   10
2016-01-01 02:00:00 3   12
2016-01-01 03:00:00 3   14
2016-01-01 04:00:00 3   12
2016-01-01 05:00:00 3   8
2016-01-01 06:00:00 3   9
2016-01-01 07:00:00 3   12

What i would like is to fill in the NAN values by looking at the rise in values by the other object's. So for example if object id 2 and 3 rise 50% on average from 3:00:00 to 4:00:00 i can use the value 12 in this example and multiply it by 1.5. Im working in python at the moment using pandas.

I can run a function through the column 'value' checking if its NAN. But im sorta stuck how to go from there.

Basicly i want a function that when it finds a missing value to take the timestamp, then for that timestamp find the values the other objects. So in this example its missing data at 4:00:00, it will grab 16 of object 2 and 12 of object 3. Once it done that it will compare the found value to the value before this timestamp and divide by this. So 16/14 and 12/14. the average of this is 1, which i want multiplied by the value before the missing value. So 1.00 * 12.00. It has then filled in the timestamp at 4:00:00 and continue at 5:00:00

  • You are probably searching for Interpolation? https://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation – MisterMonk Dec 05 '18 at 21:04
  • Hi Justin, what numbers would you expect for the two Nan that you have in your example? – Sander van den Oord Dec 05 '18 at 21:07
  • the values in here are made up, the values will respond the same to the change as the other objects are, so if object 2 goes up 50% in an hour i expect object 1 to do that aswell. I would like to take the average of all other objects who have data and then use that to estimate what the value would be – Justin van Dongen Dec 05 '18 at 21:12

1 Answers1

1

You may need to do a little munging to make sure your NaNs are seen correctly by pandas.

"""
timestamp,id,value
2016-01-01 01:00:00,1,10
2016-01-01 02:00:00,1,15
2016-01-01 03:00:00,1,12
2016-01-01 04:00:00,1,nan
2016-01-01 05:00:00,1,nan
2016-01-01 06:00:00,1,7
"""

import pandas as pd
import numpy as np

df = pd.read_clipboard(sep=",")
df.value = df.value.astype(float)
df.interpolate(method="linear")

returns:

0    10.000000
1    15.000000
2    12.000000
3    10.333333
4     8.666667
5     7.000000
Name: value, dtype: float64
Evan
  • 2,121
  • 14
  • 27
  • its imported alright i just made a quick draft of how the data looks, the gap is to big(nearly a year) to just interpolate for in this case. I want to estimate the values looking at similair measurement points – Justin van Dongen Dec 05 '18 at 21:10
  • OK. Can you post data that is more representative of your actual data, and/or clarify your question? For example, include the output you'd like to see. – Evan Dec 05 '18 at 21:12
  • 1
    You are missing a year? Thats an tough task to accomplish – MisterMonk Dec 05 '18 at 21:14
  • https://gyazo.com/d17904b4f69ce7cacabca47fd9a08b7a these are the values, the values of 1 object will respond the same to the change as the other objects, so if object 2 goes up 50% in an hour i expect object 1 to do that aswell. I would like to take the average of all other objects who have data and then use that to estimate what the value would be of the missing value – – Justin van Dongen Dec 05 '18 at 21:16