R - Finding max number of consecutive values by key

Question

I have a dataset of messages sent to users, some have succeeded and some have failed:

> df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
> df.messages
               date id status
1  2018-01-01 12:00  1      S
2  2018-01-01 12:00  2      S
3  2018-01-01 12:00  3      S
4  2018-01-02 12:00  1      S
5  2018-01-02 12:00  2      S
6  2018-01-02 12:00  3      F
7  2018-01-03 12:00  1      S
8  2018-01-03 12:00  2      F
9  2018-01-03 12:00  3      F
10 2018-01-04 12:00  1      F
11 2018-01-04 12:00  2      F
12 2018-01-04 12:00  3      S

Here's what to note:

One message is sent each day, over four days
id 1 succeeds (S) three times, then fails (F)
id 2 succeeds twice, then fails twice
id 3 succeeds once, then fails twice, then succeeds

I would like to break the users into four groups

those that always succeeded
those that failed, then succeeded later
those that succeeded, then failed without ever succeeding again
those that always failed

And then understand

the max number of times a user in group 2 failed before succeeding again
the max amount of time a user in group 2 failed before succeeding again
the max number of times a user in group 3 failed
the max amount of time a user in group 3 failed

The ideal output would be

  id group num_f_messages date_f_messages
1  1     3              1               1
2  2     3              2               2
3  3     2              2               2

I know I need to use rle() and diff(), but it's getting complicated and I haven't had to do this type of analysis before. I'm pretty lost.

I have 9MM rows, so I'm trying to accomplish this with data.table, but any solutions are welcome.

Edit:

I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S.

What group for those who failed first, and succeded a lest one time after ? — MrSmithGoesToWashington, Jan 23 '18 at 13:51
That's a good question - group 2 should actually be "those that failed, then succeeded". I will fix. — Chris, Jan 23 '18 at 16:06

score 1 · Answer 1 · answered Jan 23 '18 at 14:02

1

You can try this:

require(plyr); require(dplyr)

df.messages %>% 
  group_by(id) %>% 
  summarise(group = ifelse(sum(status == "S") == n(), 1, 
                           ifelse(sum(status == "F") == n(), 4,
                                  ifelse(n_distinct(status) > 1 & 
                                           status[1] == "S" & status[n()] == "S", 2, 3))),
            num_f_messages = sum(status == "F"),
            date_f_messages = n_distinct(date[status == "F"]))

gives you:

# A tibble: 3 x 4
     id group num_f_messages date_f_messages
  <dbl> <dbl>          <int>           <int>
1     1     3              1               1
2     2     3              2               2
3     3     2              2               2

answered Jan 23 '18 at 14:02

Codutie

1,055
13
25

Thanks for this! I want to make sure I understand what's happening here - does this take into account the order in which the S and F messages come in, and the run length for S vs F? So, if a user had messages "S,F,S,F,F,F,S", would it detect that their longest run of "F" before an "S" was 3? It looks like it would give 4, since that would be sum(status == "F")? – Chris Jan 23 '18 at 18:20
Hey Chris.. sorry I haven‘t read that you need the max amount of succeding F, if you haven‘t found a solution I will adapt the code tomorrow! – Codutie Jan 23 '18 at 19:33

score 1 · Accepted Answer · answered Jan 23 '18 at 17:20

Here is a data.table solution.

library(data.table)
library(magrittr)

df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
df.messages$status <- as.character(df.messages$status)
setDT(df.messages)


ans <- df.messages[,
            .(
                by_rle = paste0(rle(status)$value, collapse = ""),
                num_f_message = sum(status == "F"),
                date_f_message = length(unique(date[status == "F"]))
            ), 
            by = id] %>%
    # define groups and remove the by_rle columns
    .[by_rle == "S", group := 1] %>%
    .[by_rle == c("SFS"), group := 2] %>%
    .[by_rle == c("SF"), group := 3] %>%
    .[by_rle == "F", group := 4] %>%
    .[, by_rle := NULL] %>%
    setcolorder(c("id", "group", "num_f_message", "date_f_message"))

#    id group num_f_message date_f_message
# 1:  1     3             1              1
# 2:  2     3             2              2
# 3:  3     2             2              2

Thanks for this. I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S. Is that easy to incorporate into what you've already written? — Chris, Jan 23 '18 at 18:26
I was able to adapt from your example to my exact use, thanks GL_Li! Since my question was probably overly specific and your solution got me halfway there, I'll select this as the answer. — Chris, Jan 24 '18 at 14:20

R - Finding max number of consecutive values by key

2 Answers2