2

I have the following dataframe which I am trying to setup to use in some regression analysis:

Date        Person     Feature1  Feature2 ....  Feature100
1/1/2020      Jim         12        15               82
1/7/2020      Jim         1         25               84
1/1/2021      Jim         12        15               85
1/1/2020      Jan         1         35               86
1/7/2020      Jan         5         15               84
1/1/2021      Jan         14        5               82

I have created a list of some of the columns I would like to transform (about 50 columns):

l = ['Feature1','Feature2',......'Feature58']

For each of columns names found in l I would like to create a weighted average of each persons last 20 entries (weighted to the most recent) and shifted by 1 (because I hope to use it as a predicition feature).

Date        Person     Feature1  Feature2 ....  Feature100   Feature1_Shifted    Feature2_Shifted ... Feature58_Shifted
1/1/2020      Jim         12        15               82        N/A               N/A
1/7/2020      Jim         1         25               84        12               15
1/1/2021      Jim         12        15               85        6.5              20
1/1/2020      Jan         1         35               86        N/A              N/A
1/7/2020      Jan         5         15               84        1                35
1/1/2021      Jan         14        5               82         3                25

Loosly based off this quesition here: Most Pythonic Way to Create Many New Columns in Pandas

and for the weighted average here: https://stackoverflow.com/a/57602282/13194245 with the option of changing the weightings.

I am struggling to combine everything into one (and where to start). So any help would be greatly appreciated. Thanks!

SOK
  • 1,732
  • 2
  • 15
  • 33
  • The answer on calculating weighted average you linked to uses a numpy function that requires specific weights specified for each value. Pandas has another function for calculating exponential (decaying) weighted averages: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html. Is that what you are hoping to implement? Or will you have weights defined in their own column? – AlexK Apr 05 '21 at 04:50
  • That looks like it could be a good option! I havent defined any weights in their own columns so that sounds like it will work! – SOK Apr 05 '21 at 04:55
  • Hi @AlexK. Do you know if you can set a window period with `ewm`? it doesnt look like you can? – SOK Apr 05 '21 at 05:19
  • 1
    See if the link I provided helps you – AlexK Apr 05 '21 at 05:23

1 Answers1

1

If the exponential decay approach is acceptable to you, there is actually already a post that should help you: Create a rolling custom EWMA on a pandas dataframe. The second (not accepted) answer implements a custom function to allow rolling EWM operations over a specified window.

To get it to work with your list of columns, you would just replace the last line of that code with:

l.append('Person')
ewm_df = df[l].groupby('Person').rolling(window).apply(ewma).reset_index(level=1, drop=True)

where l is your list of columns. If you want to add '_shifted' to each column name, rename should help as a last step:

ewm_df.rename({col: col + '_shifted' for col in l}, axis=1, inplace=True)

The window will be 20 in your case.

EDIT

I am not yet sure how to add the results to the original df using the code already provided. You would need a merge key besides "Person" in the ewm_df dataframe and then merge the two dfs on two keys. My workaround would be to create new columns in the original dataframe one by one:

for col in l:
  df[f"{col}_shifted"] = df.groupby('Person')[col].rolling(window).apply(ewma).values

EDIT 2

Since there is a custom function used, the shift operation can be done separately at the end:

for col in l:
   df[f"{col}_shifted"] = df.groupby('Person')[f"{col}_shifted"].shift()

EDIT 3

Okay, this whole thing requires a more involved statement. The statement below will add the new shifted columns to the original dataframe. I tested this statement on a simple mean (replacing ewma with np.mean) and it works.

for col in l:
   df[f"{col}_shifted"] = df.groupby('Person')[col].transform(lambda x: x.rolling(20).apply(ewma).shift())
AlexK
  • 2,855
  • 9
  • 16
  • 27
  • Thanks! I was playing around with `ewm` weightings and found the Center of Mass (`com`) arguement was best set at `2.5`. So in this solution i would need to adjust the 'alpha` arguement somehow? Outside of this, looks like a great option – SOK Apr 05 '21 at 05:28
  • and also can i add an `_shifted` to the newly created column name? – SOK Apr 05 '21 at 05:30
  • @SOK Updated the answer to add this suffix to each new column. – AlexK Apr 05 '21 at 05:35
  • Thanks, i seem to be getting a key error with `Person`. Is the line `ewm_df = df[l].groupby('Person')` removing person from the `df`? – SOK Apr 05 '21 at 05:38
  • Yep, added a line to add 'Person' to the same list. – AlexK Apr 05 '21 at 05:43
  • 1
    There is a formula for `com` in [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html) that maps `alpha` to `com`, so you should be able to derive the `alpha` value from that. – AlexK Apr 05 '21 at 05:45
  • Is there any way to add the columns to the existing df? – SOK Apr 05 '21 at 05:45
  • 1
    @SOK I added a workaround for creating new columns in the original df. If I think of a better way, I'll add it. – AlexK Apr 05 '21 at 06:09
  • Looks great - only thing is the `window` function seems to be applying to every row and not the specific group. Appreciate your help with this! – SOK Apr 05 '21 at 06:25
  • 1
    This is great, thanks so much for all of your help – SOK Apr 05 '21 at 07:09