How to interpolate between values on a dataframe using Python and/or R

Question

I have a dataset that looks like this:

Which I imported into a pandas dataframe using pandas.read_csv with the Year and Country column as indexes. What I need to do is change the timestep from every 5 years to annually, and interpolate said values, and I really haven't got a clue as to how to do that. I am learning both R and python, so help in either language would be highly appreciated.

unutbu · Answer 1 · 2016-06-05T20:29:46.400

If you give your DataFrame a DatetimeIndex, then you can take advantage of the df.resample and df.interpolate('time') methods.

To make df.index a DatetimeIndex you might be tempted to use set_index('Year'). However, the Year by itself is not unique since it is repeated for each Country. In order to call resample we will need a unique index. So use df.pivot instead:

# convert integer years into `datetime64` values
In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
In [442]: df.pivot(index='Year', columns='Country')
Out[442]: 
                Avg1                      Avg2                
Country    Australia Austria Belgium Australia Austria Belgium
Year                                                          
1950-01-01         0       0       0         0       0       0
1955-01-01         1       1       1        10      10      10
1960-01-01         2       2       2        20      20      20
1965-01-01         3       3       3        30      30      30

You can then use df.resample('A').mean() to resample the data with yearly frequency. You can think of resample('A') as chopping up df into groups of 1-year intervals. resample returns a DatetimeIndexResampler object whose mean method aggregates the values in each group by taking the mean. Thus mean() returns a DataFrame with one row for every year. Since your original df has one datum every 5 years, most of the 1-year groups will be empty, so the mean returns NaNs for those years. If your data is consistently spaced at 5-year intervals, then instead of .mean() you could use .first() or .last() instead. They would all return the same result.

In [438]: df.resample('A').mean()
Out[438]: 
                Avg1                      Avg2                
Country    Australia Austria Belgium Australia Austria Belgium
Year                                                          
1950-12-31       0.0     0.0     0.0       0.0     0.0     0.0
1951-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1952-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1953-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1954-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1955-12-31       1.0     1.0     1.0      10.0    10.0    10.0
1956-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1957-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1958-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1959-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1960-12-31       2.0     2.0     2.0      20.0    20.0    20.0
1961-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1962-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1963-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1964-12-31       NaN     NaN     NaN       NaN     NaN     NaN
1965-12-31       3.0     3.0     3.0      30.0    30.0    30.0

And then df.interpolate(method='time') will linearly interpolate missing NaN values based on the nearest non-NaN values and their associated datetime index values.

import numpy as np
import pandas as pd

countries = 'Australia Austria Belgium'.split()
year = np.arange(1950, 1970, 5)
df = pd.DataFrame(
    {'Country': np.repeat(countries, len(year)),
     'Year': np.tile(year, len(countries)),
     'Avg1': np.tile(np.arange(len(year)), len(countries)),
     'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
df = df.pivot(index='Year', columns='Country')

df = df.resample('A').mean()
df = df.interpolate(method='time')

df = df.stack('Country')
df = df.reset_index()
df = df.sort_values(by=['Country', 'Year'])
print(df)

yields

         Year    Country      Avg1       Avg2
0  1950-12-31  Australia  0.000000   0.000000
3  1951-12-31  Australia  0.199890   1.998905
6  1952-12-31  Australia  0.400329   4.003286
9  1953-12-31  Australia  0.600219   6.002191
12 1954-12-31  Australia  0.800110   8.001095
15 1955-12-31  Australia  1.000000  10.000000
18 1956-12-31  Australia  1.200328  12.003284
21 1957-12-31  Australia  1.400109  14.001095
...

Beautiful. I understand what `df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')` is doing- but I can't find the codes `'i8'` or `'datetime64[Y]'` anywhere. Am I looking in the wrong places or did you read the source-code to come up with those? — michael_j_ward, Jun 04 '16 at 20:51
@michael_j_ward: My understanding of `datetime64` mainly comes from http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html plus a lot of fooling around. The docs mention (and the dtype name `datetime64` hints strongly) that the underlying data type are 8-byte ints. So in order to do numerical math on datetime64s it is sometimes necessary to use `astype('i8')` to convert the `datetime64` to its underlying integer value. The `Code` columns [displayed here](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units) show the possible `datetime64[...]` dtypes. — unutbu, Jun 04 '16 at 21:15
@michael_j_ward A more readable alternative to `(df['Year'].astype('i8')-1970).view('datetime64[Y]')` is `pd.to_datetime(df['Year'], format='%Y')`. For large Series, however, it is significantly slower. — unutbu, Jun 04 '16 at 21:29
I don't think I quite understand the resample parte, specially the .mean() part. Would you care to elaborate? — Olovram, Jun 05 '16 at 20:00
@MartinValdez: I added a little explanation of what `df.resample('A').mean()` is doing. I'm not sure I explained it very well, however, so be sure to also [read the docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html). — unutbu, Jun 05 '16 at 20:32

score 1 · Answer 2 · answered Jun 04 '16 at 19:59

This is a tough one, but I think I have it.

Here's a example with a sample dataframe:

df = pd.DataFrame({'country': ['australia', 'australia', 'belgium','belgium'], 
                   'year': [1980, 1985, 1980, 1985],
                   'data1': [1,5, 10, 15],
                   'data2': [100,110, 150,160]})
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1980,1986)])
df = df.interpolate()
df = df.reset_index()

For your specific data, assuming every country has data for every 5 years between 1950 and 2010 (inclusive) it would be

df = pd.read_csv('path_to_data')
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1950,2011)])
df = df.interpolate()
df = df.reset_index()

Kind of a tough problem. Interested to see if someone has a better solution

Batman · Answer 3 · 2016-06-04T19:58:17.960

0

First, reindex the frame. Then use df.apply and Series.interpolate

Something like:

import pandas as pd

df = pd.read_csv(r'folder/file.txt')
rows = df.shape[0]
df.index = [x for x in range(0, 5*rows, 5)]
df = df.reindex(range(0, 5*rows))
df.apply(pandas.Series.interpolate)
df.apply(pd.Series.interpolate, inplace=True)

edited Jun 04 '16 at 19:58

answered Jun 04 '16 at 19:44

Batman

8,571
7
41
80

Very nice. Last two lines could just use the pandas data-frame methods, though `df = df.interpolate()` `df = df.ffill()` – michael_j_ward Jun 04 '16 at 20:08
Given that the year series jumps from 2010-1950 each time data for a country ends, when you interpolate the "discontinuity" gets smoothed (Sorry, english is not my native language) – Olovram Jun 05 '16 at 01:22

How to interpolate between values on a dataframe using Python and/or R

3 Answers3