Moving averages with multiple GroupBy

Question

This is a small rep of my data:

Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))

Player <- c("Paul George", "David West", "Roy Hibbert",
            "Paul George", "Tim Duncan", "Manuel Ginobili",
            "Tony Parker", "Boris Diaw","Danny Green", 
            "Kawhi Leonard", "Matt Bonner", "Patty Mills",
            "George Hill", "C.J.Miles","Tim Duncan",
            "Manuel Ginobili", "Tony Parker", "Boris Diaw")

Team_PTS <- c(101,101,101,98,105,105,105,105,
              105,105,105,105,98,98,89,89,89,128)

Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
               "2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-16","2015-05-16","2015-05-29",
               "2015-05-29","2015-05-29","2015-06-03"))

Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))

df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)

df

   Team          Player Team_PTS       Date Team_Gamenumber Desired_output
1   ind     Paul George      101 2015-05-14               1            101
2   ind      David West      101 2015-05-14               1            101
3   ind     Roy Hibbert      101 2015-05-14               1            101
4   ind     Paul George       98 2015-05-16               2           99.5
5   sas      Tim Duncan      105 2015-05-15               1            105
6   sas Manuel Ginobili      105 2015-05-15               1            105
7   sas     Tony Parker      105 2015-05-15               1            105
8   sas      Boris Diaw      105 2015-05-15               1            105
9   sas     Danny Green      105 2015-05-15               1            105
10  sas   Kawhi Leonard      105 2015-05-15               1            105
11  sas     Matt Bonner      105 2015-05-15               1            105
12  sas     Patty Mills      105 2015-05-15               1            105
13  ind     George Hill       98 2015-05-16               2           99.5
14  ind       C.J.Miles       98 2015-05-16               2           99.5
15  sas      Tim Duncan       89 2015-05-29               2             97
16  sas Manuel Ginobili       89 2015-05-29               2             97
17  sas     Tony Parker       89 2015-05-29               2             97
18  sas      Boris Diaw      128 2015-06-03               3         107.33

The desired output variable is the moving or cummulative average of the Team points (sas and ind in this example).

I have tried:

library(dplyr)
df %>% group_by(Team) %>%
       mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))

However that yields a wrong output since the information is organized by players. See Boris Diaw misses game 2 with sas but plays on game 3.

Also I think cumsum is not the right approach in this case because the average will be affected by the number of players that play every single match.

The 107.33 comes from the average of sas first 3 games (105 + 89 + 128)/3

it is not clear to me how do you want to group the elements. First grouping by `Team` and then? — SabDeM, Aug 12 '15 at 19:38
@SabDem you group by Team but the information is organized by players (there are multiple players playing each game) so that is what is difficulting my calculation of the moving average. See for ind once paul george plays the 2nd game then the average moves from 101 to 99.5 because ind score 101 and 98 pts in first and second game — Sburg13, Aug 12 '15 at 19:46

Arun · Answer 1 · 2015-08-13T07:07:00.753

Here's another way. I'll do it using data.table:

require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
                      cumsum(Team_PTS * dups) / cumsum(dups)
                    }, by = Team]

Or just write a function:

foo <- function(points, game) {
    dups = !duplicated(game)
    cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]

There's still a difference between the solutions of @bgoldst and @jeremycg. @bgoldst's computes the cumulative averages on the data sorted by Team, Team_Gamenumber, where as @jeremycg's computes by preserving the original order.

For example, from your df, interchange the game number for ind = 1:

setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)

Then try both the versions.

We can get both answers while preserving the original order of the data as follows:

# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]

bgoldst · Accepted Answer · 2015-08-12T20:09:24.037

It appears that your Team_PTS column is redundant, in that it contains the number of points scored by the entire Team in game Team_Gamenumber, however the data.frame contains one row per player per game (that that player played in). Thus every record for that Team and Team_Gamenumber has the same Team_PTS value.

You can therefore "aggregate" the original df on Team and Team_Gamenumber, taking the first element of the redundant Team_PTS vector for the group, since all values in the group are the same anyway. As part of this aggregate() call I also solved the issue that you stored the Team_PTS values as strings, which were turned into factors by the data.frame() call. The easiest way I am aware that this can be done is by coercing from factor to actual character strings and then to numeric.

The aggregated table can then by supplemented with the Desired_Output column by grouping by Team and then using the cumsum(x)/seq_along(x) formula. This result can then be merged back with df to produce the desired result.

Also note that I manually reordered output to correspond with your expected output, just so we can verify easily by eye that it matches.

df <- data.frame(Team=rep(c('ind','sas','ind','sas'),c(4,8,2,4)),Player=c('Paul George','David West','Roy Hibbert','Paul George','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw','Danny Green','Kawhi Leonard','Matt Bonner','Patty Mills','George Hill','C.J.Miles','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw'),Team_PTS=c(101,101,101,98,105,105,105,105,105,105,105,105,98,98,89,89,89,128),Date=as.Date(c('2015-05-14','2015-05-14','2015-05-14','2015-05-16','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-16','2015-05-16','2015-05-29','2015-05-29','2015-05-29','2015-06-03')),Team_Gamenumber=rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1)));
output <- merge(df,transform(aggregate(cbind(Team_PTS=as.double(as.character(Team_PTS)))~Team+Team_Gamenumber,df,`[`,1),Desired_Output=ave(Team_PTS,Team,FUN=function(x) cumsum(x)/seq_along(x))))[,c(names(df),'Desired_Output')];
output[c(1:4,9,10,7,8,13,14,11,12,5,6,16:18,15),];
##    Team          Player Team_PTS       Date Team_Gamenumber Desired_Output
## 1   ind     Paul George      101 2015-05-14               1       101.0000
## 2   ind      David West      101 2015-05-14               1       101.0000
## 3   ind     Roy Hibbert      101 2015-05-14               1       101.0000
## 4   ind     Paul George       98 2015-05-16               2        99.5000
## 9   sas      Tim Duncan      105 2015-05-15               1       105.0000
## 10  sas Manuel Ginobili      105 2015-05-15               1       105.0000
## 7   sas     Tony Parker      105 2015-05-15               1       105.0000
## 8   sas      Boris Diaw      105 2015-05-15               1       105.0000
## 13  sas     Danny Green      105 2015-05-15               1       105.0000
## 14  sas   Kawhi Leonard      105 2015-05-15               1       105.0000
## 11  sas     Matt Bonner      105 2015-05-15               1       105.0000
## 12  sas     Patty Mills      105 2015-05-15               1       105.0000
## 5   ind     George Hill       98 2015-05-16               2        99.5000
## 6   ind       C.J.Miles       98 2015-05-16               2        99.5000
## 16  sas      Tim Duncan       89 2015-05-29               2        97.0000
## 17  sas Manuel Ginobili       89 2015-05-29               2        97.0000
## 18  sas     Tony Parker       89 2015-05-29               2        97.0000
## 15  sas      Boris Diaw      128 2015-06-03               3       107.3333

many thanks! is it necessary to rearrange columns to calculate output? — Sburg13, Aug 12 '15 at 20:03
That's a side effect of the `merge()` call; it always places the "merge-by" columns first. If you want to preserve the original column order, you can add an index operation onto the end of the massive one-liner (thus making it more massive) as follows: `output <- ...[,c(names(df),'Desired_Output')]`. — bgoldst, Aug 12 '15 at 20:05
@Sburg13 I just edited my answer to include the column order preservation index operation. — bgoldst, Aug 12 '15 at 20:10
thanks! works well!! the issue of team_PTS stored as strings doesn't need to be solved since in my real data that is not an issue and I created the issue when posting the example — Sburg13, Aug 12 '15 at 20:17

jeremycg · Answer 3 · 2015-08-12T20:03:48.260

using dplyr, in a horrible mess:

df %>% distinct(Team, Team_Gamenumber) %>%
       group_by(Team) %>%
       mutate(cumavg_PTS = cummean(Team_PTS)) %>%
       select(Team, Team_Gamenumber, cumavg_PTS) %>%
       inner_join(df, .)

Joining by: c("Team", "Team_Gamenumber")
   Team          Player Team_PTS       Date Team_Gamenumber cumavg_PTS
1   ind     Paul George      101 2015-05-14               1   101.0000
2   ind      David West      101 2015-05-14               1   101.0000
3   ind     Roy Hibbert      101 2015-05-14               1   101.0000
4   ind     Paul George       98 2015-05-16               2    99.5000
5   sas      Tim Duncan      105 2015-05-15               1   105.0000
6   sas Manuel Ginobili      105 2015-05-15               1   105.0000
7   sas     Tony Parker      105 2015-05-15               1   105.0000
8   sas      Boris Diaw      105 2015-05-15               1   105.0000
9   sas     Danny Green      105 2015-05-15               1   105.0000
10  sas   Kawhi Leonard      105 2015-05-15               1   105.0000
11  sas     Matt Bonner      105 2015-05-15               1   105.0000
12  sas     Patty Mills      105 2015-05-15               1   105.0000
13  ind     George Hill       98 2015-05-16               2    99.5000
14  ind       C.J.Miles       98 2015-05-16               2    99.5000
15  sas      Tim Duncan       89 2015-05-29               2    97.0000
16  sas Manuel Ginobili       89 2015-05-29               2    97.0000
17  sas     Tony Parker       89 2015-05-29               2    97.0000
18  sas      Boris Diaw      128 2015-06-03               3   107.3333

Your solution works perfect with the example but it shows some issues with the real data. I have been checking your formula and I understand all you do until the inner_join part. Could you explain how that works? thank you again — Sburg13, Aug 12 '15 at 22:34
see ?inner_join - it merges the original df with the modified one based on Team and Team_Gamenumber. What is it doing wrong? You might need to add `arrange(Team, Team_Gamenumber)` if your data isn't in order — jeremycg, Aug 12 '15 at 22:42
but the example data is not in order either... will keep trying and let you know if works. thanks — Sburg13, Aug 12 '15 at 22:45
you could also replace inner_join by merge(df, .) and it will give the same answer if that helps — jeremycg, Aug 12 '15 at 22:51

Moving averages with multiple GroupBy

3 Answers3