0

I would like to resample df by creating monthly data for all columns and filling in missing values with 0, within the time frame of say 2019-01-01 to 2019-12-31.

df:

    ITEM_ID Date        Value YearMonth
0   101002  2019-03-31  1.0   2019-03
1   101002  2019-04-30  1.0   2019-04
2   101002  2019-10-31  0.0   2019-10
3   101002  2019-11-30  8.0   2019-11
4   101002  2019-12-31  5.0   2019-12

Expected output:

    ITEM_ID Date        Value YearMonth
         ...            0     2019-01 (added)
         ...            0     2019-02 (added)
0   101002  2019-03-31  1.0   2019-03
1   101002  2019-04-30  1.0   2019-04
         ...            0     2019-05 (added)
         ...            0     2019-06 (added)
         ...            0     2019-07 (added)
         ...            0     2019-08 (added)
         ...            0     2019-09 (added)
2   101002  2019-10-31  0.0   2019-10
3   101002  2019-11-30  8.0   2019-11
4   101002  2019-12-31  5.0   2019-12

I came across a few methods like multiindex and resample. multiindex seems to be versatile but gets a bit complicated when it involves different levels of indexes; I am not sure if resample allows me to extend the effect to specified time frame. What is the best way to do it?

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

3 Answers3

1

Here is the solution

import pandas as pd

df1= # this is the dataframe which you have given example. please change accordingly.
print(df1)

data=[['2019-01'],['2019-02'],['2019-03'],['2019-04'],['2019-05'],['2019-06'],['2019-07'],['2019-08'],
['2019-09'],['2019-10'],['2019-11'],['2019-12']]
df2=pd.DataFrame(data=data,columns=['YearMonth'])
print(df2)

final_DF = pd.merge(df1,df2,on ='YearMonth',how ='outer').sort_values('YearMonth')
final_DF = final_DF.fillna(0)
print(final_DF)
Vivs
  • 447
  • 4
  • 11
1

Instead of thinking in terms of year and month columns, we created an empty data frame with a start and end date and time and combined it with the original data frame.

df['Date'] = pd.to_datetime(df['Date'])
df1 = pd.DataFrame(index=pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1M'))).reset_index()
df1 = df1.merge(df, left_on='index', right_on='Date', how='outer')
df1['yearmonth'] = df1['index'].apply(lambda x: str(x.year) + '-' + '{:02}'.format(x.month))
df1
    index   ITEM_ID Date    Value   YearMonth   yearmonth
0   2019-01-31  NaN NaT NaN NaN 2019-01
1   2019-02-28  NaN NaT NaN NaN 2019-02
2   2019-03-31  101002.0    2019-03-31  1.0 2019-03 2019-03
3   2019-04-30  101002.0    2019-04-30  1.0 2019-04 2019-04
4   2019-05-31  NaN NaT NaN NaN 2019-05
5   2019-06-30  NaN NaT NaN NaN 2019-06
6   2019-07-31  NaN NaT NaN NaN 2019-07
7   2019-08-31  NaN NaT NaN NaN 2019-08
8   2019-09-30  NaN NaT NaN NaN 2019-09
9   2019-10-31  101002.0    2019-10-31  0.0 2019-10 2019-10
10  2019-11-30  101002.0    2019-11-30  8.0 2019-11 2019-11
11  2019-12-31  101002.0    2019-12-31  5.0 2019-12 2019-12
r-beginners
  • 31,170
  • 3
  • 14
  • 32
1

I think you need DataFrame.reindex:

df['YearMonth'] = pd.to_datetime(df['YearMonth'])
r = pd.to_datetime(pd.date_range('2019-01-01', '2020-01-01', freq='1MS'))
mux = pd.MultiIndex.from_product([df['ITEM_ID'].unique(), r], names=['ITEM_ID','YearMonth'])

df = df.set_index(['ITEM_ID','YearMonth']).reindex(mux).fillna({'Value':0}).reset_index().reindex(df.columns, axis=1)

print (df)
    ITEM_ID        Date  Value  YearMonth
0    101002         NaN    0.0 2019-01-01
1    101002         NaN    0.0 2019-02-01
2    101002  2019-03-31    1.0 2019-03-01
3    101002  2019-04-30    1.0 2019-04-01
4    101002         NaN    0.0 2019-05-01
5    101002         NaN    0.0 2019-06-01
6    101002         NaN    0.0 2019-07-01
7    101002         NaN    0.0 2019-08-01
8    101002         NaN    0.0 2019-09-01
9    101002  2019-10-31    0.0 2019-10-01
10   101002  2019-11-30    8.0 2019-11-01
11   101002  2019-12-31    5.0 2019-12-01
12   101002         NaN    0.0 2020-01-01
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252