I have a dataset of messages sent to users, some have succeeded and some have failed:
> df.messages <- data.frame(date = c("2018-01-01 12:00","2018-01-01 12:00","2018-01-01 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-02 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-03 12:00","2018-01-04 12:00","2018-01-04 12:00","2018-01-04 12:00"), id = c(1,2,3,1,2,3,1,2,3,1,2,3), status = c("S","S","S","S","S","F","S","F","F","F","F","S"))
> df.messages
date id status
1 2018-01-01 12:00 1 S
2 2018-01-01 12:00 2 S
3 2018-01-01 12:00 3 S
4 2018-01-02 12:00 1 S
5 2018-01-02 12:00 2 S
6 2018-01-02 12:00 3 F
7 2018-01-03 12:00 1 S
8 2018-01-03 12:00 2 F
9 2018-01-03 12:00 3 F
10 2018-01-04 12:00 1 F
11 2018-01-04 12:00 2 F
12 2018-01-04 12:00 3 S
Here's what to note:
- One message is sent each day, over four days
- id 1 succeeds (S) three times, then fails (F)
- id 2 succeeds twice, then fails twice
- id 3 succeeds once, then fails twice, then succeeds
I would like to break the users into four groups
- those that always succeeded
- those that failed, then succeeded later
- those that succeeded, then failed without ever succeeding again
- those that always failed
And then understand
- the max number of times a user in group 2 failed before succeeding again
- the max amount of time a user in group 2 failed before succeeding again
- the max number of times a user in group 3 failed
- the max amount of time a user in group 3 failed
The ideal output would be
id group num_f_messages date_f_messages
1 1 3 1 1
2 2 3 2 2
3 3 2 2 2
I know I need to use rle()
and diff()
, but it's getting complicated and I haven't had to do this type of analysis before. I'm pretty lost.
I have 9MM rows, so I'm trying to accomplish this with data.table, but any solutions are welcome.
Edit:
I'm trying to extend this function to a larger dataset. So in a scenario where id 3's messages were "S,F,F,S,F,F,F,S", I need to reflect a maximum of 3 Fs before the final S.