0

I am trying to create a new column of the moving average (window=3) called 'MA3_WHIP' of a column 'cum_year_WHIP' in a data frame. I tried the following code to achieve it:

read_and_optimized['MA3_WHIP'] = read_and_optimized['cum_year_WHIP'].rolling(3).mean()

But that for some reason does not give me the rolling average that I'd like.

Before creating the 'cum_year_WHIP' column, I sorted the df by 'YEAR_ID' and 'Game_date':

read_and_optimized.sort_values(['YEAR_ID','Game_Date'], ascending=True,inplace=True)

Then I created the 'cum_year_WHIP' column which is the column that the new rolling average column 'MA3_WHIP' is based on and was calculated by using cumsum() on math between three other columns ((cum_walks_a + cum_hits_a)/cum_innings_pitched):

read_and_optimized['cum_year_WHIP'] =(read_and_optimized['cum_year_walks_a'] + read_and_optimized['cum_year_hits_a'])/ read_and_optimized['cum_year_innings_pitched']

In particular, I'd like the 'MA3_WHIP' to be sorted by the 'YEAR_ID' and 'Game_Date' columns just as the 'cum_year_WHIP' column is, as well as grouped by the 'resp_starting_pitcher' and 'YEAR_ID' columns.

To print out what the table looks like, I use this code:

df=read_and_optimized[['YEAR_ID','Game_Date','resp_starting_pitcher','cum_year_WHIP','MA3_WHIP']].sort_values(['YEAR_ID','Game_Date'], ascending=True).groupby(['resp_starting_pitcher','YEAR_ID']).apply(print)

and from that it gives me this undesired output:

    YEAR_ID  Game_Date resp_starting_pitcher  cum_year_WHIP  MA3_WHIP
30677     2012 2012-08-25              abadf001       2.000000  1.438035
19247     2012 2012-08-31              abadf001       2.280009  1.547771
35725     2012 2012-09-05              abadf001       2.270277  1.622140
19257     2012 2012-09-12              abadf001       2.234052  1.736054
42448     2012 2012-09-18              abadf001       1.983877  1.646596
19273     2012 2012-09-24              abadf001       1.880600  1.444433
       YEAR_ID  Game_Date resp_starting_pitcher  cum_year_WHIP  MA3_WHIP
6930      2011 2011-05-21              aceva001       1.000000  1.257886
17000     2011 2011-05-26              aceva001       1.090909  1.228938
6936      2011 2011-05-31              aceva001       1.437500  1.554379
6954      2011 2011-06-21              aceva001       1.571429  1.710058

What I'd like to get instead is a rolling average of 'cum_year_WHIP' that starts over with each new 'resp_starting pitcher' and at the start of each new 'YEAR_ID'. It should look like this:

   YEAR_ID  Game_Date resp_starting_pitcher  cum_year_WHIP  MA3_WHIP
30677     2012 2012-08-25              abadf001       2.000000  Nan
19247     2012 2012-08-31              abadf001       2.280009  Nan
35725     2012 2012-09-05              abadf001       2.270277  2.183428
19257     2012 2012-09-12              abadf001       2.234052  2.261446
42448     2012 2012-09-18              abadf001       1.983877  2.162735
19273     2012 2012-09-24              abadf001       1.880600  2.032843

       YEAR_ID  Game_Date resp_starting_pitcher  cum_year_WHIP  MA3_WHIP
6930      2011 2011-05-21              aceva001       1.000000  Nan
17000     2011 2011-05-26              aceva001       1.090909  Nan
6936      2011 2011-05-31              aceva001       1.437500  1.171613
6954      2011 2011-06-21              aceva001       1.571429  1.366612

       YEAR_ID  Game_Date resp_starting_pitcher  cum_year_WHIP  MA3_WHIP
7210      2013 2013-04-11              aceva001       1.800000  Nan
13938     2013 2013-04-17              aceva001       1.900000  Nan
7226      2013 2013-04-23              aceva001       2.250006  1.983333
7260      2013 2013-05-27              aceva001       2.068969  2.072991
44210     2013 2013-06-12              aceva001       1.894739  2.071238
7276      2013 2013-06-18              aceva001       1.780222  1.914643

When I use the following it works to produce a view of the way the table should look: read_and_optimized.groupby(['resp_starting_pitcher','YEAR_ID'])['cum_year_WHIP'].rolling(3).mean() However, when I try to make a new column from the above code as suggested in other posts of similar problems, it gives me an error:

read_and_optimized['MA3_WHIP']= read_and_optimized.groupby(['resp_starting_pitcher','YEAR_ID'])['cum_year_WHIP'].rolling(window=3).mean()

The error is:

TypeError: incompatible index of inserted column with frame index

Is there a way I can accomplish creating this new column in the data frame?

I have viewed answers to similar dilemmas at: Why is groupby and rolling not working together?

and at: Pandas - moving averages - use values of previous X entries for current row

but am not able to get it done.

Help getting this done would be much appreciated.

LeeZee
  • 107
  • 1
  • 12
  • You tried `read_and_optimized.groupby(['resp_starting_pitcher','YEAR_ID'])['cum_year_WHIP'].rolling(3).mean()` ? – Henry Ecker Jul 06 '21 at 18:24
  • Hi Henry, thank you, but when I try to create a new column called 'MA3_WHIP' using that code, I get an error. read_and_optimized['MA3_WHIP']= read_and_optimized.groupby(['resp_starting_pitcher','YEAR_ID'])['cum_year_WHIP'].rolling(window=3).mean().reset_index(0,drop=True). The error is TypeError: incompatible index of inserted column with frame index – LeeZee Jul 06 '21 at 19:09

1 Answers1

0

Ok, I finally found a post that applied to my situation to assist. As pointed out in the answer to the question at: https://stackoverflow.com/questions/52801540/pandas-groupby-then-rolling-meanI had to

what was necessary was to reset the indexes of the groupby columns, in this case the columns 'resp_starting_pitcher' and 'YEAR_ID', and then drop them within the code to create the new rolling average column:

read_and_optimized['MA3_WHIP']=read_and_optimized.groupby(['resp_starting_pitcher','YEAR_ID'])['cum_year_WHIP'].rolling(3).mean().reset_index(level = ('resp_starting_pitcher','YEAR_ID'), drop = True)
LeeZee
  • 107
  • 1
  • 12