3

Let's say I have a time series where I usually have data available for a certain continous span of years, but missing values before and after that span, like this:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)

   year  cakes eaten
0  2000          NaN
1  2001          NaN
2  2002          NaN
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          NaN
7  2007          NaN

Is there a way to fill (a given number of) missing values based on the trend seen in the available values?

Let's say I want to fill a maximum of 2 values in each direction, the result would have to look like this:

   year  cakes eaten
0  2000          NaN
1  2001          1.0
2  2002          2.0
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          6.0
7  2007          7.0

Also: is there a way to ensure that this imputation is only performed when there are enough available values , say for example I only want to fill a maximum of 2 values in each direction if there are at least 3 values available (or in more general terms, fill n only if n + m are availalbe) ?

Christian O.
  • 468
  • 2
  • 12

2 Answers2

1

I would use the mentioned interpolate(). There are various methods you can use that would produce different results. I used krogh method to get a linear trend line. limit_direction='both' is required to populate trend in both directions:

test_dict  = {'col': [np.nan, np.nan,np.nan, np.nan, np.nan, 4, 5, 6 ,np.nan]}
df = pd.DataFrame(test_dict)
df['trend'] = df['col'].interpolate(method='krogh', limit_direction='both')

    col trend
0   NaN -1.0
1   NaN 0.0
2   NaN 1.0
3   NaN 2.0
4   NaN 3.0
5   4.0 4.0
6   5.0 5.0
7   6.0 6.0
8   NaN 7.0

Once that is done you can remove the unneeded below 0 trend values.

olv1do
  • 96
  • 6
  • 1
    Thanks for re-pointing me to the interpolate function, it does indeed seem to be able to do what I want. Krogh works very well for the example I posted above, but produces some very strange values if the trend is not perfectly linear. However, I found that the `spline` method with `order = 2` works much better – Christian O. Dec 29 '20 at 09:10
1

Thanks to @olv1do for showing me that interpolate() does what I want.

Using interpolate and .first_valid_index and .last_valid_index allows to implement the desired behaviour:

#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
  first_valid = data['cakes eaten'].first_valid_index()
  last_valid = data['cakes eaten'].last_valid_index()

  if(abs(first_valid - last_valid) + 1 >= m):
    data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
  return data

For the example from the question:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)

year    cakes eaten     imputed
0   2000    NaN     NaN
1   2001    NaN     1.0
2   2002    NaN     2.0
3   2003    3.0     3.0
4   2004    4.0     4.0
5   2005    5.0     5.0
6   2006    NaN     6.0
7   2007    NaN     7.0

Does nothing if there are fewer then m values available:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4,  np.nan, np.nan, np.nan]})
interpolate(df, 2,3)

    year    cakes eaten
0   2000    NaN
1   2001    NaN
2   2002    NaN
3   2003    3.0
4   2004    4.0
5   2005    NaN
6   2006    NaN
7   2007    NaN

Also, the spline method also works very well if the values are not as perfectly linear as in my example:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2,  3, np.nan, np.nan]})
interpolate(df, 1,4)

    year    cakes eaten     imputed
0   2000    NaN     NaN
1   2001    NaN     1.381040
2   2002    1.0     1.000000
3   2003    4.0     4.000000
4   2004    2.0     2.000000
5   2005    3.0     3.000000
6   2006    NaN     3.433167
7   2007    NaN     NaN
Christian O.
  • 468
  • 2
  • 12