I'm trying to count # consecutive days inactive (consecDaysInactive
), per ID.
I have already created an indicator variable inactive
that is 1 on days where id is inactive and 0 when active. I also have an id variable, and a date variable. My analysis dataset will have hundreds of thousands of rows, so efficiency will be important.
The logic I'm trying to create is as follows:
- per id, if user is active,
consecDaysInactive
= 0 - per id, if user is inactive, and was active on previous day,
consecDaysInactive
= 1 - per id, if user is inactive on previous day,
consecDaysInactive
= 1 + # previous consecutive inactive days consecDaysInactive
should reset to 0 for new values of id.
I've been able to create a cumulative sum, but unable to get it to reset at 0 after >= rows of inactive==0.
I've illustrated below the result that I want (consecDaysInactive
), as well as the result that I was able to achieve programmatically (bad_consecDaysInactive
).
library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))
d <- d %>%
group_by(id) %>%
arrange(id, date) %>%
do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
)
)
d
where consecDaysInactive
iterates by +1 for each consecutive day inactive, but resets to 0 each date user is active, and resets to 0 for new values of id. As the output shows below, I'm unable to get bad_consecDaysInactive
to reset to 0 -- e.g. row
id date inactive consecDaysInactive bad_consecDaysInactive
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0
2 1 2017-01-02 0 0 0
3 1 2017-01-03 0 0 0
4 1 2017-01-04 1 1 1
5 1 2017-01-05 1 2 2
6 1 2017-01-06 1 3 3
7 1 2017-01-07 0 0 3
8 1 2017-01-08 1 1 4
9 2 2017-01-01 0 0 0
10 2 2017-01-02 1 1 1
11 2 2017-01-03 1 2 2
12 2 2017-01-04 1 3 3
13 2 2017-01-05 1 4 4
14 2 2017-01-06 0 0 4
15 2 2017-01-07 0 0 4
16 2 2017-01-08 1 1 5
I also considered (and tried) incrementing a variable within group_by()
& do()
, but since do()
isn't iterative, I can't get my counter to get past 2:
d2 <- d %>%
group_by(id) %>%
do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive))))
d2
which yielded, as described above:
id date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0 0
2 1 2017-01-02 0 0 0 0
3 1 2017-01-03 0 0 0 0
4 1 2017-01-04 1 1 1 1
5 1 2017-01-05 1 2 2 2
6 1 2017-01-06 1 3 3 2
7 1 2017-01-07 0 0 3 0
8 1 2017-01-08 1 1 4 1
9 2 2017-01-01 0 0 0 0
10 2 2017-01-02 1 1 1 1
11 2 2017-01-03 1 2 2 2
12 2 2017-01-04 1 3 3 2
13 2 2017-01-05 1 4 4 2
14 2 2017-01-06 0 0 4 0
15 2 2017-01-07 0 0 4 0
16 2 2017-01-08 1 1 5 1
As you can see, my iterator bad_consecDaysInactive2
resets at 0, but doesn't increment past 2! If there's a data.table solution, I'd be happy to hear it as well.