3

I'm trying to count # consecutive days inactive (consecDaysInactive), per ID.

I have already created an indicator variable inactive that is 1 on days where id is inactive and 0 when active. I also have an id variable, and a date variable. My analysis dataset will have hundreds of thousands of rows, so efficiency will be important.

The logic I'm trying to create is as follows:

  • per id, if user is active, consecDaysInactive = 0
  • per id, if user is inactive, and was active on previous day, consecDaysInactive = 1
  • per id, if user is inactive on previous day, consecDaysInactive = 1 + # previous consecutive inactive days
  • consecDaysInactive should reset to 0 for new values of id.

I've been able to create a cumulative sum, but unable to get it to reset at 0 after >= rows of inactive==0.

I've illustrated below the result that I want (consecDaysInactive), as well as the result that I was able to achieve programmatically (bad_consecDaysInactive).

library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))

d <- d %>% 
  group_by(id) %>% 
  arrange(id, date) %>% 
  do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
  )
  )
d

where consecDaysInactive iterates by +1 for each consecutive day inactive, but resets to 0 each date user is active, and resets to 0 for new values of id. As the output shows below, I'm unable to get bad_consecDaysInactive to reset to 0 -- e.g. row

          id       date inactive consecDaysInactive bad_consecDaysInactive
       <dbl>     <date>    <dbl>              <dbl>                  <dbl>
    1      1 2017-01-01        0                  0                      0
    2      1 2017-01-02        0                  0                      0
    3      1 2017-01-03        0                  0                      0
    4      1 2017-01-04        1                  1                      1
    5      1 2017-01-05        1                  2                      2
    6      1 2017-01-06        1                  3                      3
    7      1 2017-01-07        0                  0                      3
    8      1 2017-01-08        1                  1                      4
    9      2 2017-01-01        0                  0                      0
    10     2 2017-01-02        1                  1                      1
    11     2 2017-01-03        1                  2                      2
    12     2 2017-01-04        1                  3                      3
    13     2 2017-01-05        1                  4                      4
    14     2 2017-01-06        0                  0                      4
    15     2 2017-01-07        0                  0                      4
    16     2 2017-01-08        1                  1                      5

I also considered (and tried) incrementing a variable within group_by() & do(), but since do() isn't iterative, I can't get my counter to get past 2:

d2 <- d %>%
  group_by(id) %>% 
  do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive)))) 
d2 

which yielded, as described above:

      id       date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
   <dbl>     <date>    <dbl>              <dbl>                  <dbl>                   <dbl>
1      1 2017-01-01        0                  0                      0                       0
2      1 2017-01-02        0                  0                      0                       0
3      1 2017-01-03        0                  0                      0                       0
4      1 2017-01-04        1                  1                      1                       1
5      1 2017-01-05        1                  2                      2                       2
6      1 2017-01-06        1                  3                      3                       2
7      1 2017-01-07        0                  0                      3                       0
8      1 2017-01-08        1                  1                      4                       1
9      2 2017-01-01        0                  0                      0                       0
10     2 2017-01-02        1                  1                      1                       1
11     2 2017-01-03        1                  2                      2                       2
12     2 2017-01-04        1                  3                      3                       2
13     2 2017-01-05        1                  4                      4                       2
14     2 2017-01-06        0                  0                      4                       0
15     2 2017-01-07        0                  0                      4                       0
16     2 2017-01-08        1                  1                      5                       1

As you can see, my iterator bad_consecDaysInactive2 resets at 0, but doesn't increment past 2! If there's a data.table solution, I'd be happy to hear it as well.

rsty
  • 33
  • 4
  • something like this? `library(data.table); setDT(d)[, consecDaysInactive2:=cumsum(inactive), by=.(id, cumsum(!inactive))]` – chinsoon12 Mar 23 '17 at 04:48
  • `library(data.table); setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]` – Frank Mar 23 '17 at 05:11
  • Thanks chinsoon12 and Frank -- these both worked well. I'll take this as an opportunity to explore the data.table library. @Frank, regarding your marking this post as duplicate, I thought this was different from the article you marked wherein OP asks for a way to use a data.table function in dplyr, with goal of making a constant that increments with distinct ID values. The operation I was attempting was different, and I wasn't asking for a data.table method in dplyr; dplyr was the method I tried but was not able to accomplish my goal, hence the question. Again, thanks for your help. – rsty Mar 23 '17 at 17:55
  • @rsty Np. Yeah, yours turns out to be a two-part problem (as I see it): the first is resolved by the link (how to group on runs of 1s); and the second is then fairly trivial (to count within those groups). So in dplyr `group_by(g = rleid(something)) %>% mutate(r = if (x[1]==1) row_number() else 0L)` or similar. If someone else wants to undupe it, that's fine, but I guess I'll leave it. – Frank Mar 23 '17 at 17:59

1 Answers1

2

Here's a cute way to do it with a for-loop:

a <- c(1,1,1,1,0,0,1,0,1,1,1,0,0)
b <- rep(NA, length(a))
b[1] <- a[1]
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
}
a
b

It may not be the most efficient way to do this, but it will be pretty darn fast. 11.7 seconds for ten million rows on my computer.

a <- round(runif(10000000,0,1))
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
}
b
Sys.time()-t

Time difference of 11.73612 secs

But this doesn't account for the need to do things per id. That's easy to fix, at a minimal efficiency penalty. Your example dataframe is sorted by id. If you actual data are not already sorted, then do so. Then:

a <- round(runif(10000000,0,1))
id <- round(runif(10000000,1,1000))
id <- id[order(id)]
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
  if(id[i] != id[i-1]){
    b[i] <- a[i]
  }
}
b
Sys.time()-t

Time difference of 13.54373 secs

If we include the time that it took to sort id, then the time difference is closer to 19 seconds. Still not too bad!

How much of an efficiency savings can we get using Frank's answer in the comments on the OP?

d <- data.frame(inactive=a, id=id)

t2 <- Sys.time()
b <- setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]
Sys.time()-t2

Time difference of 2.233547 secs

Jacob Socolar
  • 1,172
  • 1
  • 9
  • 23
  • Thanks Jacob! This worked with minor tweaks to put it in data-frame form. This is a nice for loop solution; my R is pretty rusty, this will be useful going forward. – rsty Mar 23 '17 at 17:41