3

I have a dataset of messages sent to users, some have succeeded and some have failed:

> df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
> df.messages
               date id status
1  2018-01-01 12:00  1      S
2  2018-01-01 12:00  2      S
3  2018-01-01 12:00  3      S
4  2018-01-02 12:00  1      S
5  2018-01-02 12:00  2      S
6  2018-01-02 12:00  3      F
7  2018-01-03 12:00  1      S
8  2018-01-03 12:00  2      F
9  2018-01-03 12:00  3      F
10 2018-01-04 12:00  1      F
11 2018-01-04 12:00  2      F
12 2018-01-04 12:00  3      S

Here's what to note:

  • One message is sent each day, over four days
  • id 1 succeeds (S) three times, then fails (F)
  • id 2 succeeds twice, then fails twice
  • id 3 succeeds once, then fails twice, then succeeds

I would like to break the users into four groups

  1. those that always succeeded
  2. those that failed, then succeeded later
  3. those that succeeded, then failed without ever succeeding again
  4. those that always failed

And then understand

  • the max number of times a user in group 2 failed before succeeding again
  • the max amount of time a user in group 2 failed before succeeding again
  • the max number of times a user in group 3 failed
  • the max amount of time a user in group 3 failed

The ideal output would be

  id group num_f_messages date_f_messages
1  1     3              1               1
2  2     3              2               2
3  3     2              2               2

I know I need to use rle() and diff(), but it's getting complicated and I haven't had to do this type of analysis before. I'm pretty lost.

I have 9MM rows, so I'm trying to accomplish this with data.table, but any solutions are welcome.

Edit:

I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S.

Chris
  • 313
  • 1
  • 11

2 Answers2

1

You can try this:

require(plyr); require(dplyr)

df.messages %>% 
  group_by(id) %>% 
  summarise(group = ifelse(sum(status == "S") == n(), 1, 
                           ifelse(sum(status == "F") == n(), 4,
                                  ifelse(n_distinct(status) > 1 & 
                                           status[1] == "S" & status[n()] == "S", 2, 3))),
            num_f_messages = sum(status == "F"),
            date_f_messages = n_distinct(date[status == "F"]))

gives you:

# A tibble: 3 x 4
     id group num_f_messages date_f_messages
  <dbl> <dbl>          <int>           <int>
1     1     3              1               1
2     2     3              2               2
3     3     2              2               2
Codutie
  • 1,055
  • 13
  • 25
  • Thanks for this! I want to make sure I understand what's happening here - does this take into account the order in which the S and F messages come in, and the run length for S vs F? So, if a user had messages "S,F,S,F,F,F,S", would it detect that their longest run of "F" before an "S" was 3? It looks like it would give 4, since that would be sum(status == "F")? – Chris Jan 23 '18 at 18:20
  • Hey Chris.. sorry I haven‘t read that you need the max amount of succeding F, if you haven‘t found a solution I will adapt the code tomorrow! – Codutie Jan 23 '18 at 19:33
1

Here is a data.table solution.

library(data.table)
library(magrittr)

df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
df.messages$status <- as.character(df.messages$status)
setDT(df.messages)


ans <- df.messages[,
            .(
                by_rle = paste0(rle(status)$value, collapse = ""),
                num_f_message = sum(status == "F"),
                date_f_message = length(unique(date[status == "F"]))
            ), 
            by = id] %>%
    # define groups and remove the by_rle columns
    .[by_rle == "S", group := 1] %>%
    .[by_rle == c("SFS"), group := 2] %>%
    .[by_rle == c("SF"), group := 3] %>%
    .[by_rle == "F", group := 4] %>%
    .[, by_rle := NULL] %>%
    setcolorder(c("id", "group", "num_f_message", "date_f_message"))

#    id group num_f_message date_f_message
# 1:  1     3             1              1
# 2:  2     3             2              2
# 3:  3     2             2              2
GL_Li
  • 1,758
  • 1
  • 11
  • 25
  • Thanks for this. I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S. Is that easy to incorporate into what you've already written? – Chris Jan 23 '18 at 18:26
  • I was able to adapt from your example to my exact use, thanks GL_Li! Since my question was probably overly specific and your solution got me halfway there, I'll select this as the answer. – Chris Jan 24 '18 at 14:20