3

I have a dataset that looks like this:

image of a dataset

Which I imported into a pandas dataframe using pandas.read_csv with the Year and Country column as indexes. What I need to do is change the timestep from every 5 years to annually, and interpolate said values, and I really haven't got a clue as to how to do that. I am learning both R and python, so help in either language would be highly appreciated.

Johan B
  • 890
  • 3
  • 23
  • 39
Olovram
  • 53
  • 1
  • 7

3 Answers3

7
  • If you give your DataFrame a DatetimeIndex, then you can take advantage of the df.resample and df.interpolate('time') methods.

  • To make df.index a DatetimeIndex you might be tempted to use set_index('Year'). However, the Year by itself is not unique since it is repeated for each Country. In order to call resample we will need a unique index. So use df.pivot instead:

    # convert integer years into `datetime64` values
    In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
    In [442]: df.pivot(index='Year', columns='Country')
    Out[442]: 
                    Avg1                      Avg2                
    Country    Australia Austria Belgium Australia Austria Belgium
    Year                                                          
    1950-01-01         0       0       0         0       0       0
    1955-01-01         1       1       1        10      10      10
    1960-01-01         2       2       2        20      20      20
    1965-01-01         3       3       3        30      30      30
    
  • You can then use df.resample('A').mean() to resample the data with yearly frequency. You can think of resample('A') as chopping up df into groups of 1-year intervals. resample returns a DatetimeIndexResampler object whose mean method aggregates the values in each group by taking the mean. Thus mean() returns a DataFrame with one row for every year. Since your original df has one datum every 5 years, most of the 1-year groups will be empty, so the mean returns NaNs for those years. If your data is consistently spaced at 5-year intervals, then instead of .mean() you could use .first() or .last() instead. They would all return the same result.

    In [438]: df.resample('A').mean()
    Out[438]: 
                    Avg1                      Avg2                
    Country    Australia Austria Belgium Australia Austria Belgium
    Year                                                          
    1950-12-31       0.0     0.0     0.0       0.0     0.0     0.0
    1951-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1952-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1953-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1954-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1955-12-31       1.0     1.0     1.0      10.0    10.0    10.0
    1956-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1957-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1958-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1959-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1960-12-31       2.0     2.0     2.0      20.0    20.0    20.0
    1961-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1962-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1963-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1964-12-31       NaN     NaN     NaN       NaN     NaN     NaN
    1965-12-31       3.0     3.0     3.0      30.0    30.0    30.0
    
  • And then df.interpolate(method='time') will linearly interpolate missing NaN values based on the nearest non-NaN values and their associated datetime index values.


import numpy as np
import pandas as pd

countries = 'Australia Austria Belgium'.split()
year = np.arange(1950, 1970, 5)
df = pd.DataFrame(
    {'Country': np.repeat(countries, len(year)),
     'Year': np.tile(year, len(countries)),
     'Avg1': np.tile(np.arange(len(year)), len(countries)),
     'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
df = df.pivot(index='Year', columns='Country')

df = df.resample('A').mean()
df = df.interpolate(method='time')

df = df.stack('Country')
df = df.reset_index()
df = df.sort_values(by=['Country', 'Year'])
print(df)

yields

         Year    Country      Avg1       Avg2
0  1950-12-31  Australia  0.000000   0.000000
3  1951-12-31  Australia  0.199890   1.998905
6  1952-12-31  Australia  0.400329   4.003286
9  1953-12-31  Australia  0.600219   6.002191
12 1954-12-31  Australia  0.800110   8.001095
15 1955-12-31  Australia  1.000000  10.000000
18 1956-12-31  Australia  1.200328  12.003284
21 1957-12-31  Australia  1.400109  14.001095
...
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Beautiful. I understand what `df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')` is doing- but I can't find the codes `'i8'` or `'datetime64[Y]'` anywhere. Am I looking in the wrong places or did you read the source-code to come up with those? – michael_j_ward Jun 04 '16 at 20:51
  • 2
    @michael_j_ward: My understanding of `datetime64` mainly comes from http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html plus a lot of fooling around. The docs mention (and the dtype name `datetime64` hints strongly) that the underlying data type are 8-byte ints. So in order to do numerical math on datetime64s it is sometimes necessary to use `astype('i8')` to convert the `datetime64` to its underlying integer value. The `Code` columns [displayed here](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units) show the possible `datetime64[...]` dtypes. – unutbu Jun 04 '16 at 21:15
  • 1
    @michael_j_ward A more readable alternative to `(df['Year'].astype('i8')-1970).view('datetime64[Y]')` is `pd.to_datetime(df['Year'], format='%Y')`. For large Series, however, it is significantly slower. – unutbu Jun 04 '16 at 21:29
  • I don't think I quite understand the resample parte, specially the .mean() part. Would you care to elaborate? – Olovram Jun 05 '16 at 20:00
  • @MartinValdez: I added a little explanation of what `df.resample('A').mean()` is doing. I'm not sure I explained it very well, however, so be sure to also [read the docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html). – unutbu Jun 05 '16 at 20:32
1

This is a tough one, but I think I have it.

Here's a example with a sample dataframe:

df = pd.DataFrame({'country': ['australia', 'australia', 'belgium','belgium'], 
                   'year': [1980, 1985, 1980, 1985],
                   'data1': [1,5, 10, 15],
                   'data2': [100,110, 150,160]})
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1980,1986)])
df = df.interpolate()
df = df.reset_index()

For your specific data, assuming every country has data for every 5 years between 1950 and 2010 (inclusive) it would be

df = pd.read_csv('path_to_data')
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1950,2011)])
df = df.interpolate()
df = df.reset_index()

Kind of a tough problem. Interested to see if someone has a better solution

michael_j_ward
  • 4,369
  • 1
  • 24
  • 25
0

First, reindex the frame. Then use df.apply and Series.interpolate

Something like:

import pandas as pd

df = pd.read_csv(r'folder/file.txt')
rows = df.shape[0]
df.index = [x for x in range(0, 5*rows, 5)]
df = df.reindex(range(0, 5*rows))
df.apply(pandas.Series.interpolate)
df.apply(pd.Series.interpolate, inplace=True)
Batman
  • 8,571
  • 7
  • 41
  • 80
  • Very nice. Last two lines could just use the pandas data-frame methods, though `df = df.interpolate()` `df = df.ffill()` – michael_j_ward Jun 04 '16 at 20:08
  • Given that the year series jumps from 2010-1950 each time data for a country ends, when you interpolate the "discontinuity" gets smoothed (Sorry, english is not my native language) – Olovram Jun 05 '16 at 01:22