2

The original DF has only the index (Datetime[ns]) and the objective variable.

I have first added a few columns extracting information from the 'dates' index

df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.weekofyear
df['weekday'] = ''
df.loc[df.dayofweek <= 4, 'weekday'] = 1
df.loc[df.dayofweek > 4, 'weekday'] = 0
df['weekend'] = ''
df.loc[df.dayofweek <= 4, 'weekend'] = 0
df.loc[df.dayofweek > 4, 'weekend'] = 1

Then, the function which calculates inefficiently the average of the (week, weekend, weekdays, month or year) from any (days, weeks, months or years) ago for every value in the time series.

The problem is that as I loop over the whole DataFrame this is very inefficient.

I thought about using vectorization but I can't find the solution

FUNCTION:

def operator_past_X_from_n_Y_ago ( df, y , fcn , period_type1 , days_ago , weeks_ago=0 , months_ago=0 , years_ago=0 ):
    """
    This functions takes a pd:DataFrame as input (with dates as intex) and creates a column calculating the average
    or some of the other operators (fcn) over the objective variable (y) within a past period of time like: week, weekdays,
    weekend ,month or year from any days, weeks, months or years ago as wanted.
    
    ## EXAMPLES ##
    
    Calculate the average of the past month from 7 days ago.
    operator_past_X_from_n_Y_ago ( df, y=objective , fcn='mean' , period_type1='m' , days_ago=7)
    
    Calculate the standard deviation of last weekend from 2 weeks and 3 days ago.
    operator_past_X_from_n_Y_ago ( df, y=objective , fcn='std' , period_type1='weekend' , days_ago=7, weeks_ago=2)
    
    
    ## POSSIBLE ARGUMENTS ##
    
    df = (Pandas.DataFrame) : Dataset containing the objective variable and dates as index.
    y = (String) : Name of the objective variable as it appearst in the df column.
    fcn = (String) : 'avg' for average, 'std' for standard deviation.
    period_type1 = (String) : Period of time over the one that you want to aply the (fcn). The options are:
    'd','M','w','weekdays','weekend','y'.
    days_ago , weeks_ago , months_ago , years_ago  = (Int) : The moment of time from the one that you want to know
    the (fcn) over the past (period_type).
    
    ## LIBS NEEDED ##
    
    import numpy as np
    import pandas as pd
    
    """

    fcn_dict = {'avg': np.mean , 'std': np.std }
    join_dict = {'w': 'weekofyear' , 'M': 'month' }
    skip_dict = {'d':1,'w':7,'weekend':7,'weekdays':7,'M':31,'y':365}



    skip =   skip_dict[period_type1] + days_ago + weeks_ago*7 + months_ago*31 + years_ago*366

    
    ## FEATURE NAME ## 
    feature_name = fcn + '_past_' + period_type1 + '_from_'
    
    if days_ago > 0  :
        feature_name = feature_name + str(days_ago) + 'days_' 
    if weeks_ago > 0  :
        feature_name = feature_name + str(weeks_ago) + 'weeks_'
    if months_ago > 0  :
        feature_name = feature_name + str(months_ago) + 'months_'        
    if years_ago > 0  :
        feature_name = feature_name + str(years_ago) + 'years_'        
        
    feature_name = feature_name + 'ago'
    
    df[feature_name] = [np.nan for i in range(len(df[objetivo])) ] #Creates nan column named 'feature_name'

    if period_type1 in ['d','w','M','y']:
        for i in df.index[skip:]:

            j = i - dateutil.relativedelta.relativedelta(days=days_ago , weeks=weeks_ago , months=months_ago , years=years_ago)
            df_last_period_type1 = df.loc[(df[join_dict[period_type1]] == df[join_dict[period_type1]][j]) & (df['year'] == df['year'][j])][y]
            df[feature_name][i] = fcn_dict[fcn](df_last_period_type1)

    if period_type1 == 'weekdays':
        for i in df.index[skip:]:

            j = i - dateutil.relativedelta.relativedelta(days=days_ago , weeks=weeks_ago , months=months_ago , years=years_ago)
            df_last_period_type1 = df.loc[(df['weekofyear'] == df['weekofyear'][j]) & (df['year'] == df['year'][j]) & (df['dayofweek'] < 4 )][y]
            df[feature_name][i] = fcn_dict[fcn](df_last_period_type1)

    if period_type1 == 'weekend':
        for i in df.index[skip:]:

            j = i - dateutil.relativedelta.relativedelta(days=days_ago , weeks=weeks_ago , months=months_ago , years=years_ago)
            df_last_period_type1 = df.loc[(df['weekofyear'] == df['weekofyear'][j]) & (df['year'] == df['year'][j]) & (df['dayofweek'] > 4)][y]
            df[feature_name][i] = fcn_dict[fcn](df_last_period_type1)
    
    return df
Mr. T
  • 11,960
  • 10
  • 32
  • 54
Javi
  • 33
  • 4
  • Can you please add a small sample input as text and the expected (=current?) output? Besides: If you code `df['weekday'] = 0; df.loc[df.dayofweek <= 4, 'weekday'] = 1`, then the rest of your definition is not required. Column `weekend` is just the inverted weekday column. – Mr. T Feb 23 '22 at 10:46
  • Javi, I suggest we start from one setting. I mean you choose a particular value for each of the `fcn , period_type1 , days_ago`, Then you update your question with the setting, and with a sample input data, and with an expected output. – Raymond Kwok Feb 23 '22 at 10:52

0 Answers0