3

I'm trying to build a churn model that includes the maximum consecutive number of UX failures for each customer and having trouble. Here's my simplified data and desired output:

library(dplyr)
df <- data.frame(customerId = c(1,2,2,3,3,3), date = c('2015-01-01','2015-02-01','2015-02-02', '2015-03-01','2015-03-02','2015-03-03'),isFailure = c(0,0,1,0,1,1))
> df
  customerId       date isFailure
1          1 2015-01-01         0
2          2 2015-02-01         0
3          2 2015-02-02         1
4          3 2015-03-01         0
5          3 2015-03-02         1
6          3 2015-03-03         1

desired results:

> desired.df
  customerId maxConsecutiveFailures
1          1                      0
2          2                      1
3          3                      2

I'm flailing quite a bit and searching through other rle questions isn't helping me yet - this is what I was "expecting" a solution to resemble:

df %>% 
  group_by(customerId) %>%
  summarise(maxConsecutiveFailures = 
    max(rle(isFailure[isFailure == 1])$lengths))
Jack Case
  • 329
  • 2
  • 13
  • A base R option is `sapply(split(df$isFailure, df$customerId), function(x) {tmp <- with(rle(x==1), lengths[values]); if(length(tmp)==0) 0 else tmp})` – akrun Aug 23 '15 at 07:11
  • Another option with `data.table` would be `setDT(df)[, {tmp <- rleid(isFailure)*isFailure; tmp2 <- table(tmp[.N==1|tmp!=0]); max((names(tmp2)!=0)*tmp2)}, customerId][]` – akrun Aug 23 '15 at 11:12

2 Answers2

4

We group by the 'customerId' and use do to perform the rle on 'isFailure' column. Extract the lengths that are 'TRUE' for values (lengths[values]), and create the 'Max' column with an if/else condition to return 0 for those that didn't have any 1 value.

 df %>%
    group_by(customerId) %>%
    do({tmp <- with(rle(.$isFailure==1), lengths[values])
     data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0 
                    else max(tmp)) }) %>% 
     slice(1L)
#   customerId Max
#1          1   0
#2          2   1
#3          3   2
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Here is my try, only using standard dplyr functions:

df %>% 
  # grouping key(s):
  group_by(customerId) %>%
  # check if there is any value change
  # if yes, a new sequence id is generated through cumsum
  mutate(last_one = lag(isFailure, 1, default = 100), 
         not_eq = last_one != isFailure, 
         seq = cumsum(not_eq)) %>% 
  # the following is just to find the largest sequence
  count(customerId, isFailure, seq) %>% 
  group_by(customerId, isFailure) %>% 
  summarise(max_consecutive_event = max(n))

Output:

# A tibble: 5 x 3
# Groups:   customerId [3]
  customerId isFailure max_consecutive_event
       <dbl>     <dbl>                 <int>
1          1         0                     1
2          2         0                     1
3          2         1                     1
4          3         0                     1
5          3         1                     2

A final filter on isFailure value would yield the wanted result (need to add back 0 failure count customers though).

The script can take any values of isFailure column and count the maximum consecutive days of having the same value.