Calculating the mean using logical condition

Question

I have a football dataset for a season and some variable are: player_id, week and points (a grade for each player in a match).

So, each player_id appears several times in my dataset.

My goal is to calculate the average points for each player, but just to previous weeks.

For example, to the row where player_id=5445 and week=10, I want the mean when data has player_id=5445 and week is from 1 to 9.

I know I can do it filtering data for each row and calculating it. But I hope to do it in a smarter/faster way...

I thought something like:

aggregate(mydata$points, FUN=mean, 
          by=list(player_id=mydata$player_id, week<mydata$week))

but it did not work

Thankss!!!

For `player_id=mydata$player_id` to work you need a double equal sign like `player_id==mydata$player_id` not sure that's the only issue though. — Mako212, Nov 02 '17 at 20:50
I would be helpful if you made your question reproducable by including sample data. Also, what exactly should your output look like - one row per combination of player_id and week? — jruf003, Nov 02 '17 at 20:55
to make it reproducible player_id<-c(3242,56546,76575,4234,654654,6564,43242,42344,4342,6776,5432,8796,54767) week<-1:30 points<-rnorm(390) mydata<- data.frame(player_id=rep(player_id,30),week=rep(week,13),points) I would like to create an extra column called Previous_mean — Luis Otavio Fernandes, Nov 02 '17 at 21:07

ssp3nc3r · Accepted Answer · 2017-11-02T23:20:40.523

1

Here's a solution along with some sample data,

football_df <- 
  data.frame(player_id = c(1, 2, 3, 4),
             points = as.integer(runif(40, 0, 10)), 
             week = rep(1:10, each = 4))

Getting a running average:

require(dplyr)
football_df %>% 
      group_by(player_id) %>%    # the group to perform the stat on
      arrange(week) %>%          # order the weeks within each group
      mutate(avg = cummean(points) ) %>% # for each week get the cumulative mean
      mutate(avg = lag(avg) ) %>% # shift cumulative mean back one week
      arrange(player_id) # sort by player_id

Here's the first two players of the resulting table, for which you can see that for player 1 in week 2, the previous week's average is 7, and in week 3, the previous week's average is (9+7) / 2 = 8 ... :

   player_id points week      avg
1          1      7    1       NA
2          1      9    2 7.000000
3          1      9    3 8.000000
4          1      1    4 8.333333
5          1      4    5 6.500000
6          1      8    6 6.000000
7          1      0    7 6.333333
8          1      2    8 5.428571
9          1      5    9 5.000000
10         1      8   10 5.000000
11         2      6    1       NA
12         2      9    2 6.000000
13         2      5    3 7.500000
14         2      1    4 6.666667
15         2      0    5 5.250000
16         2      9    6 4.200000
17         2      8    7 5.000000
18         2      6    8 5.428571
19         2      6    9 5.500000
20         2      8   10 5.555556

edited Nov 02 '17 at 23:20

answered Nov 02 '17 at 21:13

ssp3nc3r

3,662
2
13
23

Thanks. But I think there is something missing. It is calculating the average by week, not by player. And another thing... It would not be possible to have an previous average to any player when week equals 1. – Luis Otavio Fernandes Nov 02 '17 at 21:53
No, it's calculating the average from the first week to the prior week for each player. You are right that there isn't an average for week one, so those values are `NA`. Maybe it's easier to review if you finish by `arrange(player_id)` – ssp3nc3r Nov 02 '17 at 21:55
Note the difference between `mean` and `cummean` – ssp3nc3r Nov 02 '17 at 22:04
It is not working to me. I need to calculate the previous mean by player. And points of players 2, 3 or 4 should not influenciate in player 1 mean. – Luis Otavio Fernandes Nov 02 '17 at 23:11
I've added the tabular result showing the code works. – ssp3nc3r Nov 02 '17 at 23:21
Sorry! My calculations were really different... That´s why I sent these messages... But when I saw your result I saw that the code was doing exactly what I wanted and was not a comunication issue. So, I updated my dplyr package and it worked like it shoulded be. thank you very much – Luis Otavio Fernandes Nov 02 '17 at 23:38

score 1 · Answer 2 · answered Nov 02 '17 at 21:30

I will use your data but with a call to set.seed to make the results reproducible. Then I will call aggregate with the formula interface. Note that I've changed the name of the variable week to last_week to be used in subset.

set.seed(2550)    # make the results reproducible

player_id <- c(3242,56546,76575,4234,654654,6564,43242,42344,4342,6776,5432,8796,54767)
week <- 1:30
points <- rnorm(390)
mydata <- data.frame(player_id = rep(player_id, 30), 
                     week = rep(week,13),points)

last_week <- 10
agg <- aggregate(points ~ player_id + week, data = subset(mydata, week < last_week), mean)
head(agg)
#  player_id week     points
#1      3242    1 -1.3281831
#2      4234    1  0.3578657
#3      4342    1 -0.8267423
#4      5432    1 -0.4245487
#5      6564    1 -0.2968879
#6      6776    1  0.8348178

Calculating the mean using logical condition

2 Answers2

Linked