0

My current data has variables recorded at different time interval and I want to have all variables cleaned and nicely aligned in a weekly format by either redistribution (weekly = monthly/4) or fill in the monthly value for each week (weekly = monthly).

     df=pd.DataFrame({
                     'Date':['2020-06-03','2020-06-08','2020-06-15','2020-06-22','2020-06-29','2020-07-15','2020-08-15','2020-09-15','2020-10-14','2020-11-15','2020-12-15','2020-12-31','2021-01-15'],
                     'Date_Type':['Week_start_Mon','Week_start_Mon','Week_start_Mon','Week_start_Mon','Week_start_Mon','Monthly','Monthly','Monthly','Monthly','Monthly','Annual','Annual','Annual'],
                     'Var_Name':['A','A','A','A','B','C','C','C','E','F','G','G','H'],
                     'Var_Value':

[150,50,0,200,800,5000,2000,6000.15000,2300,3300,650000,980000,1240000]})





    Date    Date_Type   Var_Name    Var_Value
0   2020-06-03  Week_start_Mon  A   150.0
1   2020-06-08  Week_start_Mon  A   50.0
2   2020-06-15  Week_start_Mon  A   0.0
3   2020-06-22  Week_start_Mon  A   200.0
4   2020-06-29  Week_start_Mon  B   800.0
5   2020-07-15  Monthly C   5000.0
6   2020-08-15  Monthly C   2000.0
7   2020-09-15  Monthly C   6000.15
8   2020-10-14  Monthly E   2300.0
9   2020-11-15  Monthly F   3300.0
10  2020-12-15  Annual  G   650000.0
11  2020-12-31  Annual  G   980000.0
12  2021-01-15  Annual  H   1240000.0

An ideal output will look like this: For variable C, the date range will be the start to the end dates of master df. All dates are aligned and set to start on Mondays of that week. The monthly variable value is evenly distributed to 4 weeks, and there would 0 for each week in June.

Similarly annual variables will be distributed to 52 weeks.

  Date  Date_Type   Var_Name    Var_Value
    0   2020-06-01  Monthly C   0
    1   2020-06-08  Monthly C   0
    2   2020-06-15  Monthly C   0
    3   2020-06-22  Monthly C   0
    4   2020-06-29  Monthly C   0
    5   2020-07-06  Monthly C   1250
    6   2020-07-13  Monthly C   1250
    7   2020-07-20  Monthly C   1250
    8   2020-07-27  Monthly C   1250
    9   2020-08-03  Monthly C   400
   10   2020-08-10  Monthly C   400
   11   2020-08-17  Monthly C   400
   12   2020-08-24  Monthly C   400
   13   2020-08-31  Monthly C   400
   . 
   . 
   . 
   to the end date

For variable E, a percentage value that need to be filled for every week where it applies, the output would look like this:

  Date  Date_Type   Var_Name    Var_Value
    0   2020-06-01  Monthly E   0
    1   2020-06-08  Monthly E   0
    2   2020-06-15  Monthly E   0
    3   2020-06-22  Monthly E   0
    .
    .
    .
    5   2020-09-28  Monthly E   0
    6   2020-10-05  Monthly E   0.35
    7   2020-10-12  Monthly E   0.35
    8   2020-10-19  Monthly E   0.35
    9   2020-10-26  Monthly E   0.35
   10   2020-11-02  Monthly E   0
   11   2020-11-09  Monthly E   0
   12   2020-11-16  Monthly E   0

Ultimately my goal is to create a loop for treating this kind of data

if weekly
     xxxxx
if monthly
     xxxxx
if annual
     xxxxx

Please help!

riostrich
  • 1
  • 1
  • Can you write expected output for `3 2020-07-15 Monthly C 5000.0`, please? – Corralien Apr 23 '21 at 03:31
  • Hello thank you for reaching out! I updated my expected output in the questions and hope it's clear! Please let me know if there is any other questions. – riostrich Apr 23 '21 at 06:07
  • How do you determine the start date of `(Monthly, C)` (2020-06-06) ? Why june and why saturday? The end date is ...? – Corralien Apr 23 '21 at 07:22
  • That was my bad and I edited the questions. For each variable, the start date and end date are the same as the entire df. In the example, the earlier row of data is on wednesday june 3. In the output, the first date would be set to monday of that week, which is june 1. – riostrich Apr 23 '21 at 15:26
  • Too many inconsistencies: "The monthly variable value is evenly distributed to 4 weeks" but the value of variable C is spread over 5 weeks. The percentage value 0.35 disappeared from your initial data. Please edit your post from beginning and make it consistent. – Corralien Apr 23 '21 at 18:10

1 Answers1

0

This is a partial answer, I need some explanation.

  1. Set Date as index and realign all dates to Monday (I assume Date is already a datetime64 dtype)
df = df.set_index("Date")
df.index = df.index.map(lambda d: d - pd.tseries.offsets.Day(d.weekday()))
>>> df
           Date_Type Var_Name   Var_Value
Date
2020-06-01    Weekly        A      150.00
2020-06-08    Weekly        A       50.00
2020-06-15    Weekly        A        0.00
2020-06-22    Weekly        A      200.00
2020-06-29    Weekly        B      800.00
2020-07-13   Monthly        C     5000.00
2020-08-10   Monthly        C     2000.00
2020-09-14   Monthly        C     6000.15
2020-10-12   Monthly        E     2300.00
2020-11-09   Monthly        F     3300.00
2020-12-14    Annual        G   650000.00
2020-12-28    Annual        G   980000.00
2021-01-11    Annual        H  1240000.00
  1. Create the index for each variable from 2020-06-01 to 2021-01-11 with a frequency of 7 days:
dti = pd.date_range(df.index.min(), df.index.max(), freq="7D", name="Date")
>>> dti
DatetimeIndex(['2020-06-01', '2020-06-08', '2020-06-15', '2020-06-22',
               '2020-06-29', '2020-07-06', '2020-07-13', '2020-07-20',
               '2020-07-27', '2020-08-03', '2020-08-10', '2020-08-17',
               '2020-08-24', '2020-08-31', '2020-09-07', '2020-09-14',
               '2020-09-21', '2020-09-28', '2020-10-05', '2020-10-12',
               '2020-10-19', '2020-10-26', '2020-11-02', '2020-11-09',
               '2020-11-16', '2020-11-23', '2020-11-30', '2020-12-07',
               '2020-12-14', '2020-12-21', '2020-12-28', '2021-01-04',
               '2021-01-11'],
              dtype='datetime64[ns]', name='Date', freq='7D')
  1. Reindex your dataframe with the new index (pivot for a better display):
df = df.pivot(columns=["Date_Type", "Var_Name"], values="Var_Value").reindex(dti)
>>> df
Date_Type  Weekly         Monthly                    Annual
Var_Name        A      B        C       E       F         G          H
Date
2020-06-01  150.0    NaN      NaN     NaN     NaN       NaN        NaN
2020-06-08   50.0    NaN      NaN     NaN     NaN       NaN        NaN
2020-06-15    0.0    NaN      NaN     NaN     NaN       NaN        NaN
2020-06-22  200.0    NaN      NaN     NaN     NaN       NaN        NaN
2020-06-29    NaN  800.0      NaN     NaN     NaN       NaN        NaN
2020-07-06    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-07-13    NaN    NaN  5000.00     NaN     NaN       NaN        NaN
2020-07-20    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-07-27    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-08-03    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-08-10    NaN    NaN  2000.00     NaN     NaN       NaN        NaN
2020-08-17    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-08-24    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-08-31    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-09-07    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-09-14    NaN    NaN  6000.15     NaN     NaN       NaN        NaN
2020-09-21    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-09-28    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-10-05    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-10-12    NaN    NaN      NaN  2300.0     NaN       NaN        NaN
2020-10-19    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-10-26    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-11-02    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-11-09    NaN    NaN      NaN     NaN  3300.0       NaN        NaN
2020-11-16    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-11-23    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-11-30    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-12-07    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-12-14    NaN    NaN      NaN     NaN     NaN  650000.0        NaN
2020-12-21    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2020-12-28    NaN    NaN      NaN     NaN     NaN  980000.0        NaN
2021-01-04    NaN    NaN      NaN     NaN     NaN       NaN        NaN
2021-01-11    NaN    NaN      NaN     NaN     NaN       NaN  1240000.0

It only remains to fill in the missing values. It can be easy if I know how to deal with:

if weekly
     xxxxx
if monthly
     xxxxx
if annual
     xxxxx
Corralien
  • 109,409
  • 8
  • 28
  • 52