6

I have a data frame that contains multiple subjects (id), with repeated observations (recorded at times time). Each of the times may or may not be associated with an event (event). An example data frame can be generated with:

set.seed(12345)
id <- c(rep(1, 9), rep(2, 9), rep(3, 9))
time <- c(seq(from = 0, to = 96, by = 12),
      seq(from = 0, to = 80, by = 10),
      seq(from = 0, to = 112, by = 14))
random <- runif(n = 27)
event <- rep(100, 27)

df <- data.frame(cbind(id, time, event, random))
df$event <- ifelse(df$random < 0.55, 0, df$event)
df <- subset(df, select = -c(random))
df$event <- ifelse(df$time == 0, 100, df$event)

I would like to calculate the time between events (tae [time after the last event]), such that the ideal output would look like:

head(ideal_df)
  id time event tae
1  1    0   100   0
2  1   12   100   0
3  1   24   100   0
4  1   36   100   0
5  1   48     0  12
6  1   60     0  24

In fortran, I use the following code to create the tae variable:

IF(EVENT.GT.0) THEN
  TEVENT = TIME
  TAE = 0
ENDIF

IF(EVENT.EQ.0) THEN
  TAE = TIME - TEVENT
ENDIF

In R, I have attempted both an ifelse and dplyr solution. However, neither produce my desired output.

# Calculate the time since last event (using ifelse)
df$tae <- ifelse(df$event >= 0, df$tevent = df$time & df$tae = 0, df$tae = df$time - df$tevent)

Error: unexpected '=' in "df$tae <- ifelse(df$event >= 0, df$tevent ="

# Calculate the time since last event (using dplyr)
res <- df %>%
  arrange(id, time) %>%
  group_by(id) %>%
  mutate(tae = time - lag(time))
res 

   id time event tae
1   1    0   100  NA
2   1   12   100  12
3   1   24   100  12
4   1   36   100  12
5   1   48     0  12
6   1   60     0  12

Clearly, neither of these yield my desired output. It appears as though assigning variables within the ifelse function is not well tolerated by R. My attempt at a dplyr solution also fails to account for the event variable...

Lastly, another variable that recorded the time until the next event tue will be needed. If anyone happens to have a thought regarding how best to go about this (perhaps more tricky) calculation, please feel free to share.

Any thoughts regarding how to get one of these working (or an alternative solution) would be greatly appreciated. Thanks!

P.S. -- A reproducible example when the interval between events changes within an ID is presented below:

id <- rep(1, 9)
time <- c(0, 10, 22, 33, 45, 57, 66, 79, 92)
event <- c(100, 0, 0, 100, 0, 100, 0, 0, 100)
df <- data.frame(cbind(id, time, event))

head(df)
  id time event
1  1    0   100
2  1   10     0
3  1   22     0
4  1   33   100
5  1   45     0
6  1   57   100
Entropy
  • 378
  • 6
  • 16

4 Answers4

10

Here's an approach with dplyr:

library(dplyr)
df %>%
  mutate(tmpG = cumsum(c(FALSE, as.logical(diff(event))))) %>%
  group_by(id) %>%
  mutate(tmp_a = c(0, diff(time)) * !event,
         tmp_b = c(diff(time), 0) * !event) %>%
  group_by(tmpG) %>%
  mutate(tae = cumsum(tmp_a),
         tbe = rev(cumsum(rev(tmp_b)))) %>%
  ungroup() %>%
  select(-c(tmp_a, tmp_b, tmpG))

The new columns include time after event (tae) and time before event (tbe).

The result:

   id time event tae tbe
1   1    0   100   0   0
2   1   12   100   0   0
3   1   24   100   0   0
4   1   36   100   0   0
5   1   48     0  12  48
6   1   60     0  24  36
7   1   72     0  36  24
8   1   84     0  48  12
9   1   96   100   0   0
10  2    0   100   0   0
11  2   12     0  12  24
12  2   24     0  24  12
13  2   36   100   0   0
14  2   48     0  12  48
15  2   60     0  24  36
16  2   72     0  36  24
17  2   84     0  48  12
18  2   96     0  60   0
19  3    0   100   0   0
20  3   12   100   0   0
21  3   24     0  12  24
22  3   36     0  24  12
23  3   48   100   0   0
24  3   60   100   0   0
25  3   72   100   0   0
26  3   84     0  12  12
27  3   96   100   0   0

The result with the second example:

  id time event tae tbe
1  1    0   100   0   0
2  1   10     0  10  23
3  1   22     0  22  11
4  1   33   100   0   0
5  1   45     0  12  12
6  1   57   100   0   0
7  1   66     0   9  26
8  1   79     0  22  13
9  1   92   100   0   0
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • Excellent! Do you think it is possible to adapt this code to also calculate the time to the next event? – Entropy Oct 24 '14 at 19:14
  • 2
    +1 for this very nice solution. If you want to remove the temporary variable `tmp2` as well, you should insert `ungroup()` before the `select(-tmp, -tmp2)` – Jaap Oct 24 '14 at 19:27
  • Thanks -- I apologize for not realizing this earlier, but if the interval changes within IDs, the `rev(cumsum(...))` function fails to yield the correct result. Please see my edit in the post above for a reproducible example. – Entropy Oct 24 '14 at 21:11
  • 1
    In both of the examples here, at time = 0, there is an event. When I run this code, `tae` gives strange results for rows before the first event occurs. Similarly, in your examples the last row has an event; `tbe` doesn't seem to work when the last row has no event. – BLT Oct 04 '16 at 16:23
1

You were very close with your dplyr implementation. Try this

df %>%
  arrange(id, time) %>%
  group_by(id) %>%
  mutate(tae = cumsum(event==0)*12)
keegan
  • 2,892
  • 1
  • 17
  • 20
  • Thanks, nice answer. The only trouble is that my events are not consistently distributed XX hrs/dys/wks apart. I apologize for not making this clear in the original question and have revised the example code above to make this clear. – Entropy Oct 24 '14 at 19:09
1

I guess you might be impressed by the compactness of dplyr, but going through a lot of unnecessary calculations really hurts your time performance...

> loopfun <- function(df){
+ 
+   event <- (df$event == 100)
+   lasttime <- 0
+ 
+   time <- df$time
+   tae <- rep(0, nrow(df))
+ 
+   for(i in 1:nrow(df)){
+ 
+     if(event[i]){
+ 
+       lasttime <- time[i]
+ 
+     }else{
+ 
+       tae[i] <- time[i] - lasttime
+ 
+     }
+ 
+   }
+ 
+   df$tae <- tae
+ 
+   return(df)
+ }
> 
> dplyrfun <- function(df){
+   
+   return(df %>%
+     mutate(tmp = c(0, diff(time)) * !event,
+            tmp2 = cumsum(c(FALSE, as.logical(diff(event))))) %>%
+     group_by(tmp2) %>%
+     mutate(tae = cumsum(tmp)) %>%
+     select(-tmp, -tmp2)
+   )
+   
+ }
> 
> microbenchmark(loopfun(df), dplyrfun(df), times = 10000)
Unit: microseconds
         expr      min       lq       mean   median       uq      max neval
  loopfun(df)   57.356   70.035   95.89365   82.109   96.599 49001.19 10000
 dplyrfun(df) 1494.564 1625.274 1875.85263 1705.722 1877.336 50087.32 10000
jtanman
  • 654
  • 1
  • 4
  • 18
0

I can't think of a way to vectorize it right now, but here's a loop that should be decently quick (O(n)).

event <- (df$event == 100)
lasttime <- 0

time <- df$time
tae <- rep(0, nrow(df))

for(i in 1:nrow(df)){

    if(event[i]){

        lasttime <- time[i]

    }else{

        tae[i] <- time[i] - lasttime

    }

}

df$tae <- tae
jtanman
  • 654
  • 1
  • 4
  • 18