Summarize consecutive failures with dplyr and rle

Question

I'm trying to build a churn model that includes the maximum consecutive number of UX failures for each customer and having trouble. Here's my simplified data and desired output:

library(dplyr)
df <- data.frame(customerId = c(1,2,2,3,3,3), date = c('2015-01-01','2015-02-01','2015-02-02', '2015-03-01','2015-03-02','2015-03-03'),isFailure = c(0,0,1,0,1,1))
> df
  customerId       date isFailure
1          1 2015-01-01         0
2          2 2015-02-01         0
3          2 2015-02-02         1
4          3 2015-03-01         0
5          3 2015-03-02         1
6          3 2015-03-03         1

desired results:

> desired.df
  customerId maxConsecutiveFailures
1          1                      0
2          2                      1
3          3                      2

I'm flailing quite a bit and searching through other rle questions isn't helping me yet - this is what I was "expecting" a solution to resemble:

df %>% 
  group_by(customerId) %>%
  summarise(maxConsecutiveFailures = 
    max(rle(isFailure[isFailure == 1])$lengths))

A base R option is `sapply(split(df$isFailure, df$customerId), function(x) {tmp <- with(rle(x==1), lengths[values]); if(length(tmp)==0) 0 else tmp})` — akrun, Aug 23 '15 at 07:11
Another option with `data.table` would be `setDT(df)[, {tmp <- rleid(isFailure)*isFailure; tmp2 <- table(tmp[.N==1|tmp!=0]); max((names(tmp2)!=0)*tmp2)}, customerId][]` — akrun, Aug 23 '15 at 11:12

akrun · Accepted Answer · 2015-08-23T10:42:13.027

We group by the 'customerId' and use do to perform the rle on 'isFailure' column. Extract the lengths that are 'TRUE' for values (lengths[values]), and create the 'Max' column with an if/else condition to return 0 for those that didn't have any 1 value.

 df %>%
    group_by(customerId) %>%
    do({tmp <- with(rle(.$isFailure==1), lengths[values])
     data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0 
                    else max(tmp)) }) %>% 
     slice(1L)
#   customerId Max
#1          1   0
#2          2   1
#3          3   2

score 0 · Answer 2 · answered Dec 10 '19 at 20:42

Here is my try, only using standard dplyr functions:

df %>% 
  # grouping key(s):
  group_by(customerId) %>%
  # check if there is any value change
  # if yes, a new sequence id is generated through cumsum
  mutate(last_one = lag(isFailure, 1, default = 100), 
         not_eq = last_one != isFailure, 
         seq = cumsum(not_eq)) %>% 
  # the following is just to find the largest sequence
  count(customerId, isFailure, seq) %>% 
  group_by(customerId, isFailure) %>% 
  summarise(max_consecutive_event = max(n))

Output:

# A tibble: 5 x 3
# Groups:   customerId [3]
  customerId isFailure max_consecutive_event
       <dbl>     <dbl>                 <int>
1          1         0                     1
2          2         0                     1
3          2         1                     1
4          3         0                     1
5          3         1                     2

A final filter on isFailure value would yield the wanted result (need to add back 0 failure count customers though).

The script can take any values of isFailure column and count the maximum consecutive days of having the same value.

Summarize consecutive failures with dplyr and rle

2 Answers2

Linked