0

I have a large csv file example below,

data = pd.read_csv('C:/Users/Ene_E/Desktop/Data/data.csv')
data.head()

          name  year  value
0  Afghanistan  1800    603
1      Albania  1800    667
2      Algeria  1800    715
3      Andorra  1800   1200
4       Angola  1800    618

data.tail()

            name  year  value
46508  Venezuela  2040  17600
46509    Vietnam  2040  12300
46510      Yemen  2040   3960
46511     Zambia  2040   6590
46512   Zimbabwe  2040   3210

my large CSV has over 200 countries and their data recorded annually from 1800 - 2040, my goal is to resample this data to monthly and interpolate value column as shown below, I have used Afghanistan, the year 1800 to illustrate how my desired end result,

Expected output:

name               date    value
Afghanistan        Jan     1800   start_value
Afghanistan        Feb     1800   .
Afghanistan        Mar     1800   . 
Afghanistan        May     1800   .
Afghanistan        Jun     1800   .
Afghanistan        Jul     1800   .This column is interpolated smoothly
Afghanistan        Aug     1800   .
Afghanistan        Sep     1800   .
Afghanistan        Oct     1800   .
Afghanistan        Nov     1800   .
Afghanistan        Dec     1800   603(end value in that year)

I want all my data resampled as in the above in python because this only way my model will work. NB: the dates should take above format.

I have tried severally without success,

data['year'] = pd.to_datetime(data.year, format='%Y')
head(data)

Error:

Traceback (most recent call last):   File "<pyshell#12>", line 1, in <module>
    head(data) NameError: name 'head' is not defined

data.head()

          name                year  value
0  Afghanistan 1800-01-01 00:00:00    603
1      Albania 1800-01-01 00:00:00    667
2      Algeria 1800-01-01 00:00:00    715
3      Andorra 1800-01-01 00:00:00   1200
4       Angola 1800-01-01 00:00:00    618

data.resample('1M', how='interpolate')

Error:

Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    data.resample('1M', how='interpolate')
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 8145, in resample
    base=base, key=on, level=level)
  File "C:\Python27\lib\site-packages\pandas\core\resample.py", line 1251, in resample
    return tg._get_resampler(obj, kind=kind)
  File "C:\Python27\lib\site-packages\pandas\core\resample.py", line 1381, in _get_resampler
    "but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

data.groupby(name).resample('1M', how='interpolate')

Error:

Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    data.groupby(name).resample('1M', how='interpolate')
NameError: name 'name' is not defined

Thoughts?

Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • https://stackoverflow.com/questions/34693079/python-pandas-dataframe-interpolate-missing-data <-- does this helps – Pygirl Mar 07 '20 at 08:55
  • data.groupby('name').resample('1M', how='interpolate') not name it should be 'name' – Pygirl Mar 07 '20 at 08:55
  • Do all your readings start in Jan and end in Dec or you want the minimum starting date labelled as Start and max as End? – wwnde Mar 07 '20 at 14:09

2 Answers2

0

Conditionally assign name for where date values are missing using np.where

data['value']=np.where(data['date'].isna(), 'This column is interpolated smoothly', '')#.data.ffill(axis=0, inplace=True)

Forward fill missing dates

data['date']=pd.to_datetime(data['date']).ffill()

Group by date and reset back to dataframe

data.set_index('date', inplace=True)
data['value'] = np.where( data.index.month== 1, 'start_value', data['value']) 
data['value'] = np.where( data.index.month== 12, 'End_value', data['value'])
data.groupby(data.index.month)['name', 'value'].ffill().reset_index().sort_values(by=['name','date'], ascending=True).drop_duplicates()
wwnde
  • 26,119
  • 6
  • 18
  • 32
0

@DEVELOPER_ONE I am not familiar with interpolate or resample, but I felt like giving it a go a different way. I literally produced similar to what your desired output is:

import pandas as pd
import numpy as np
data = pd.DataFrame({'name':['Afghanistan', 'Albania', 'Zimbabwe','Afghanistan', 
                             'Albania', 'Zimbabwe'],
                     'year':[1800,1800,1800,2040,2040,2040],
                     'value' : [603,667,59,2415,2804,3210]
                     })
df_year_unique = pd.DataFrame(data['year'].drop_duplicates().reset_index(drop=True))
df_name_unique = pd.DataFrame(data['name'].drop_duplicates().reset_index(drop=True))
df_month_unique = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                                          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']})
df_name = pd.DataFrame(pd.concat([df_name_unique]*len(
df_month_unique)*len(df_year_unique),
ignore_index=True)).sort_values('name').reset_index(drop=True)
df_month = pd.DataFrame(pd.concat([df_month_unique]*len(
df_year_unique)*len(df_name_unique),
ignore_index=True))
df_year = pd.DataFrame(pd.concat([df_year_unique]*len(
df_month_unique)*len(df_name_unique),                          
ignore_index=True)).sort_values('year').reset_index(drop=True)
df_year_month = pd.merge(df_month, df_year, how='inner', left_index=True, 
right_index=True)
df_year_month_name = pd.merge(df_year_month, df_name, how='inner', left_index=True, 
right_index=True)
df = pd.merge(df_year_month_name, data, how='left', on=['name','year'])
df['value'] = np.where(df['Month'] != 'Dec', '.', df['value'])
df['value'] = np.where(df['Month'] == 'Jan', 'start_value', df['value'])
df['value'] = np.where(df['Month'] == 'Jul', '.This column is interpolated smoothly', 
df['value'])
df
David Erickson
  • 16,433
  • 2
  • 19
  • 35