Efficient way in R to add a new column to a dataframe with huge dataset

Question

I really need to speed some R code up. I have a large dataset from a particular sport. Each row in the data frame represents some type of action in the game. For each game (game_id) we have two teams (team_id) that take part in the game. time_ref in the data frame are the actions in chronological order for each game. type_id is the type of action in the game. player_off is set as TRUE or FALSE and is linked to action_id=3. action_id=3 represents a player getting a card and player_off is set to TRUE/FALSE if the player was sent off when they got that card. Example data.frame:

> df

game_id team_id action_id   player_off  time_ref
100     10         1             NA       1000
100     10         1             NA       1001
100     10         1             NA       1002
100     11         1             NA       1003
100     11         2             NA       1004
100     11         1             NA       1005
100     10         3             1        1006
100     11         1             NA       1007
100     10         1             NA       1008
100     10         1             NA       1009
101     12         3             0        1000
101     12         1             NA       1001
101     12         1             NA       1002
101     13         2             NA       1003
101     13         3             1        1004
101     12         1             NA       1005
101     13         1             NA       1006
101     13         1             NA       1007
101     12         1             NA       1008
101     12         1             NA       1009

What I need is another column in the data frame that gives me TRUE or FALSE on whether both teams had an equal/unequal number of players on the field while each action (row) took place.

So game_id=100 had an action_id=3 & player_off=1 for team_id=10 at time_ref=1006. So we know the teams were equal with number of players on the field up to that point but unequal for the rest of the game (time_ref>1006). The same thing occurred in game_id=101 also.

This an example of the data frame with an extra column I would like to have for the dataset.

>df
game_id team_id action_id   player_off  time_ref    is_even
100      10        1            NA        1000         1
100      10        1            NA        1001         1
100      10        1            NA        1002         1 
100      11        1            NA        1003         1
100      11        2            NA        1004         1
100      11        1            NA        1005         1
100      10        3            1         1006         1
100      11        1            NA        1007         0
100      10        1            NA        1008         0
100      10        1            NA        1009         0
101      12        3            0         1000         1
101      12        1            NA        1001         1
101      12        1            NA        1002         1
101      13        2            NA        1003         1
101      13        3            1         1004         1
101      12        1            NA        1005         0
101      13        1            NA        1006         0
101      13        1            NA        1007         0
101      12        1            NA        1008         0
101      12        1            NA        1009         0

So you can see that in game_id=100 a player was sent off at time_ref=1006 so all previous rows were marked as is_even=1 and subsequent marked as uneven or 0. Similar for game_id=101 at time_ref=1004.

What is the most efficient way of achieving this extra column? Preferably not using for loops.

The "most efficient" way is probably to write C code for your special case; is this really what you are asking for? — Martin Morgan, Apr 02 '18 at 12:02
I know what you mean but I'd prefer to keep it in R actually. — SteMc, Apr 02 '18 at 12:34
The comment still applies -- you're not looking for 'the most efficient', but something that's reasonably efficient for the task at hand. And the task isn't how to add a new column to a data frame, but how to transform a column by groups. I guess I'm trying to tell you to revise your title (this might help you find existing questions and answers). — Martin Morgan, Apr 02 '18 at 12:37
With data.table, your example is handled by `mDT = DT[player_off == 1, .(game_id, time_ref)]; DT[, is_even := 1L][mDT, on=.(game_id, time_ref > time_ref), is_even := 0L]` but I guess your example is not general enough (eg, doesn't show what happens after both teams have a player off and are even again). — Frank, Apr 02 '18 at 15:33

Martin Morgan · Answer 1 · 2018-04-03T07:54:13.770

For some vector

x = c(0, NA, NA, NA, 1, NA, NA, NA)

write a function to standardize the data (0 or 1 player lost), calculate the cumulative number of players lost, and compare this to zero,

fun0 = function(x)  {
    x[is.na(x)] = 0
    cumsum(x) == 0
}

For several groups, use ave() with a grouping variable

x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)

For the data in the question, try

df$is_even = ave(df$player_off, df$game_id, FUN = fun)

Semantically, it seems likely that fun0() is more complicated than implied in this solution, specifically that if each team loses a player, they are again even, as @SunLisa says. If so, clean the data

df$player_off[is.na(df$player_off)] = 0

and change fun0(), e.g.,

fun1 <- function(x, team) {
    is_team_1 <- team == head(team, 1) # is 'team' the first team?
    x1 <- x & is_team_1                # lost player & team 1
    x2 <- x & !is_team_1               # lost player & team 2
    cumsum(x1) == cumsum(x2)           # same total number of players?
}

(it doesn't seem like a good idea to coerce the logical return value to an integer). This could be applied by group with

df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
    fun1(df$player_off[i], df$team_id[i])
})

or

split(df$is_even, df$game_id) <-
    Map(fun1,
        split(df$player_off, df$game_id),
        split(df$team_id, df$game_id)
    )

The implementation of ave() is useful to look at, the important line being

split(x, g) <- lapply(split(x, g), FUN)

The right-hand side splits x by group g, then applies FUN() to each group. The left-hand side split<-() is a tricky operation, using the group indexes to update the original vector x.

Comments

The original question asked for 'no for loops', but actually lapply() (in ave()) and Map() are exactly that; ave() is relatively efficient because of the split-apply-combine strategy it adopts, rather than what the OP probably implemented, which was likely to iterate through games, subset the data frame, then update the data.frame for each game. The subsetting would have duplicated subsets of the entire data set, and the update in particular would have copied at least the entire result column on each assignment; this copying would have slowed the execution down alot. It's also possible that the OP was struggling with fun0(); it would help to clarify the question, especially title, to identify that as the problem.

There are faster ways, especially using the data.table package, but the principle is the same -- identify a function that operates on a vector the way you'd like, and apply it by group.

An alternative, fully-vectorized, solution follows this suggestion to calculate a cumulative sum by group. For fun0(), standardize x to be the number of players leaving the game at a particular timepoint, without NAs

x[is.na(x)] = 0

For the equivalent of fun(), calculate the cumulative sum of players leaving the game, irrespective of group

cs = cumsum(x)

Correct this for the group that the cumulative sum applies to

in_game = cs - (grp - 1)

and set this to 'TRUE' when 0 players have left the game

is_even = (in_game == 0)

This relies on grp indexing from 1 to the number of groups; for the data here one might grp = match(df$game_id, unique(df$game_id)). A similar solution exists for fun1().

Tried running your code and the `is_even` column seems to overshoot by 1 cell. For example, in `game_id` = 100, the answer should be 7 1s and the rest 0s, but in yours, it's 8 1s and the rest 0s. — Anonymous, Apr 02 '18 at 12:37
This is a nice start but it doesn't return the correct vector. It gives the last 5 data points in is_even as 1,0,0,1,1 when they should all be false as a player was sent off at time_ref=1004. — SteMc, Apr 02 '18 at 12:41
So for `ave()` the team_id doesn't matter; don't include it as a grouping variable. — Martin Morgan, Apr 02 '18 at 12:46
Nice, it works if you exclude `df$team_id` from `ave`. Can you explain what your code does, in particular the `fun` function that you made and the base `ave` function? — Anonymous, Apr 02 '18 at 13:26
@Anonymous I updated the answer to walk through the code in more detail. — Martin Morgan, Apr 02 '18 at 13:38

score 2 · Answer 2 · answered Apr 02 '18 at 12:32

Here's a dplyr + tidyr solution to the problem, with the summary of what was done:

Manipulate the data by converting all NAs in player_off to 0 for easier summing and assigning the smaller team_num (assuming there are only 2) to team1 and the other to team2
"Tally" the player_offs using spread and fill the invalid combinations in the data with 0 -- for example, in game_id = 100, there's no team_id = 11 for time_ref = 1000
Take the cumulative sum of the lagged team1 and team2 vectors (and of course fill NAs with 0)

Code below:

require(dplyr)
require(tidyr)

df %>%
  group_by(game_id) %>%
  mutate(
    player_off = player_off %>% replace(list = is.na(.), values = 0),
    team_num = if_else(team_id == min(team_id), "team1", "team2")
  ) %>%
  spread(key = team_num, value = player_off, fill = 0) %>%
  arrange(game_id, time_ref) %>%
  mutate(
    team1_cum = cumsum(lag(team1, default = 0)),
    team2_cum = cumsum(lag(team2, default = 0)),
    is_even = as.integer(team1_cum == team2_cum)
  ) %>%
  ungroup() %>%
  select(-team1, -team2, -team1_cum, -team2_cum)

Output:

# A tibble: 20 x 5
   game_id team_id action_id time_ref is_even
     <int>   <int>     <int>    <int>   <int>
 1     100      10         1     1000       1
 2     100      10         1     1001       1
 3     100      10         1     1002       1
 4     100      11         1     1003       1
 5     100      11         2     1004       1
 6     100      11         1     1005       1
 7     100      10         3     1006       1
 8     100      11         1     1007       0
 9     100      10         1     1008       0
10     100      10         1     1009       0
11     101      12         3     1000       1
12     101      12         1     1001       1
13     101      12         1     1002       1
14     101      13         2     1003       1
15     101      13         3     1004       1
16     101      12         1     1005       0
17     101      13         1     1006       0
18     101      13         1     1007       0
19     101      12         1     1008       0
20     101      12         1     1009       0

SunLisa · Answer 3 · 2018-04-02T15:24:51.573

Here's my think:

data.table is going to work well, especially when you are working with large data sets. It's faster. We just need to group it, cumsum 2 team's layoff, and see if they equal.

First I have to say:

(problem solved by Martin Morgan, his updated answer no longer has this error)

I don't think @Martin Morgan 's answer is right. Let's imagine a certain case:

when team 1 had one player off, after which team 2 had another player off, then 2 teams should be even, but @Martin Morgan's output would be FALSE.

I'll make an example with this dataset, where player_off of record 19 was modified to 1, which means that in game 101, after team 13 had had 1 player off at 1004, team 12 had 1 player off at 1008, which would make 2 teams even at 1009.

> dt.1
   game_id team_id action_id player_off time_ref
1      100      10         1         NA     1000
2      100      10         1         NA     1001
3      100      10         1         NA     1002
4      100      11         1         NA     1003
5      100      11         2         NA     1004
6      100      11         1         NA     1005
7      100      10         3          1     1006
8      100      11         1         NA     1007
9      100      10         1         NA     1008
10     100      10         1         NA     1009
11     101      12         3          0     1000
12     101      12         1         NA     1001
13     101      12         1         NA     1002
14     101      13         2         NA     1003
15     101      13         3          1     1004
16     101      12         1         NA     1005
17     101      13         1         NA     1006
18     101      13         1         NA     1007
19     101      12         1          1     1008
20     101      12         1         NA     1009

But @Martin Morgan 's function would produce this output:

> dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun)
> dt.1
   game_id team_id action_id player_off time_ref is_even
1      100      10         1         NA     1000       1
2      100      10         1         NA     1001       1
3      100      10         1         NA     1002       1
4      100      11         1         NA     1003       1
5      100      11         2         NA     1004       1
6      100      11         1         NA     1005       1
7      100      10         3          1     1006       1
8      100      11         1         NA     1007       0
9      100      10         1         NA     1008       0
10     100      10         1         NA     1009       0
11     101      12         3          0     1000       1
12     101      12         1         NA     1001       1
13     101      12         1         NA     1002       1
14     101      13         2         NA     1003       1
15     101      13         3          1     1004       1
16     101      12         1         NA     1005       0
17     101      13         1         NA     1006       0
18     101      13         1         NA     1007       0
19     101      12         1          1     1008       0
20     101      12         1         NA     1009       0

Notice how at line 19 and line 20, is.even=0. Which is not what op wants.

My code does not process NAs, so I am going to transform NA to 0 first.

> dt.1<-as.data.table(dt.1)
> dt.1[is.na(dt.1)]<-0

My code would produce the correct output, at time 1008 and 1009, where both team 12 and team 13 had 1 off, two teams are even.

> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id]
    game_id team_id time_ref action_id player_off even
 1:     100      10     1000         1          0    1
 2:     100      10     1001         1          0    1
 3:     100      10     1002         1          0    1
 4:     100      11     1003         1          0    1
 5:     100      11     1004         2          0    1
 6:     100      11     1005         1          0    1
 7:     100      10     1006         3          1    0
 8:     100      11     1007         1          0    0
 9:     100      10     1008         1          0    0
10:     100      10     1009         1          0    0
11:     101      12     1000         3          0    1
12:     101      12     1001         1          0    1
13:     101      12     1002         1          0    1
14:     101      13     1003         2          0    1
15:     101      13     1004         3          1    0
16:     101      12     1005         1          0    0
17:     101      13     1006         1          0    0
18:     101      13     1007         1          0    0
19:     101      12     1008         1          1    1
20:     101      12     1009         1          0    1

I understand it is a messy looking chunk of data.table code, let me explain step by step.

dt[, .(
  action_id,
  team2_off = (team_id == max(team_id)) * player_off,
  team1_off = (team_id == min(team_id)) * player_off,
  team_id,
  time_ref,
  player_off
), by = game_id][order(game_id, time_ref)][, .(team_id,
                                               time_ref,
                                               action_id,
                                               player_off,
                                               even = cumsum(team2_off) == cumsum(team1_off)), by = game_id]

first, we take data.table dt, group by game_id, and does this calculation:

  team2_off = (team_id == max(team_id)) * player_off,
  team1_off = (team_id == min(team_id)) * player_off

data.table has some problem taking 2 grouping at once (group by game_id and team_id), but it handles logical expression inside of each group well. In this way, we effectively get team1_off and team2_off, by multiplying a logical output of team_id == max/min(team_id) with player_off. When both are 1, the output would be 1, which means, 1 player was off in the selected team.

Now we have a data table of:

> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id]
    game_id action_id team2_off team1_off team_id time_ref player_off
 1:     100         1         0         0      10     1000          0
 2:     100         1         0         0      10     1001          0
 3:     100         1         0         0      10     1002          0
 4:     100         1         0         0      11     1003          0
 5:     100         2         0         0      11     1004          0
 6:     100         1         0         0      11     1005          0
 7:     100         3         0         1      10     1006          1
 8:     100         1         0         0      11     1007          0
 9:     100         1         0         0      10     1008          0
10:     100         1         0         0      10     1009          0
11:     101         3         0         0      12     1000          0
12:     101         1         0         0      12     1001          0
13:     101         1         0         0      12     1002          0
14:     101         2         0         0      13     1003          0
15:     101         3         1         0      13     1004          1
16:     101         1         0         0      12     1005          0
17:     101         1         0         0      13     1006          0
18:     101         1         0         0      13     1007          0
19:     101         1         0         1      12     1008          1
20:     101         1         0         0      12     1009          0

Now we no longer need to group by two groups (team_id, game_id), we can just do cumsum by game_id, and compare if cumsum(team1_off)==cumsum(team2_off), also, order it by game_id and time_ref, so the result would have the correct order.

I understand that NAs may have different meanings than 0 in this scenario. If you really care that much, just create a dummy column of player_off.

> dt$dummy<-dt$player_off
> dt$dummy[is.na(dt$dummy)]<-0
> dt<-as.data.table(dt)
> dt[, .(
+   action_id,
+   team2_off = (team_id == max(team_id)) * dummy,
+   team1_off = (team_id == min(team_id)) * dummy,
+   team_id,
+   time_ref,
+   player_off
+ ), by = game_id][order(game_id, time_ref)][, .(team_id,
+                                                time_ref,
+                                                action_id,
+                                                player_off,
+                                                even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id]
    game_id team_id time_ref action_id player_off even
 1:     100      10     1000         1         NA    1
 2:     100      10     1001         1         NA    1
 3:     100      10     1002         1         NA    1
 4:     100      11     1003         1         NA    1
 5:     100      11     1004         2         NA    1
 6:     100      11     1005         1         NA    1
 7:     100      10     1006         3          1    0
 8:     100      11     1007         1         NA    0
 9:     100      10     1008         1         NA    0
10:     100      10     1009         1         NA    0
11:     101      12     1000         3          0    1
12:     101      12     1001         1         NA    1
13:     101      12     1002         1         NA    1
14:     101      13     1003         2         NA    1
15:     101      13     1004         3          1    0
16:     101      12     1005         1         NA    0
17:     101      13     1006         1         NA    0
18:     101      13     1007         1         NA    0
19:     101      12     1008         1         NA    0
20:     101      12     1009         1         NA    0

I really think you question is very interesting, and I was dedicated to solve this using data.table. It took me few hours and I almost gave up on data.table, thinking that data.table just can't process two grouping at a time. I eventually solved it with a logical multiplication.

Great fun I had

  team1_off = (team_id == min(team_id)) * dummy
  team2_off = (team_id == max(team_id)) * dummy

I agree that my answer doesn't handle both teams losing players; the original question isn't clear, though semantically `is_even` implies your interpretation. I updated my response to reflect the different interpretation. — Martin Morgan, Apr 02 '18 at 15:00
@Martin Morgan, I'm really impressed by your function solution. As you said in your post, the ultimate idea is get a function which can produce the right result. Although this question is kinda tricky, your solution is highly inspiring! Thanks for your inspiration. It takes me some time to understand your function. — SunLisa, Apr 02 '18 at 15:12

Efficient way in R to add a new column to a dataframe with huge dataset

3 Answers3

Comments