Identify consecutive sequences based on a given variable

Question

I am literally stuck on this. The df1 has the following variables:

serial = Group of people
id1 = the person from the group (eg. 12 (serial) 1 (id1) =group 12 person 1; 12 2 = group 12 person 2, etc. )
'Day'when the first (or start) recording was made.

The days consist of equal number of observations (eg.95)

        day1 (Monday)  =  day11-day196 
        day2 (Tuesday) = day21-day296     
        day3 (Wednesday) =  day31-day396   
        day4 (Thursday) =  day41-day496   
        day5 (Friday) = day51-day596      
        day6 (Saturday) = day61-day696   
        day7 (Sunday) =  day71-day796

Example of df1

serial id1  Day     day1 day2 day3 day4 day5 day6 day7
12      1   Monday    2    1    2    1    1    3    1
123     1   Tuesday   0    3    0    3    3    0    3
10      1   Wednesday 0    3    3    3    3    3    3

I would like to identify the consecutive records (there is no gap between the daily records) and the total amount of the records.

The starting day for consecutive recordings is the 'Day` variable. For example a consecutive record would be serial 12. Recording started on Monday and there are records (at leas one from 95 variable) during the week. During the week (7 x 95 variable) there were made 11 records

A non-consecutive record would be id 123 as the there is a gap day on day3 and day6. Record started on Tuesday and there is a gap on Wednesday and Saturday.

Finally I would like to record the duration of the consecutive recording.

Sample output:

 serial  id1   Duration Occurance        Days
12       1      11        7        day1 day2 day3 day4 day5 day6 day7
123      1      12        0        0
10       1      18        5        day3 day4 day5 day6 day7

Sample data

structure(list(serial = c(12, 123, 10), id1 = c(1, 1, 1), Day = structure(1:3, .Label = c("Monday",
"Tuesday", "Wednesday"), class = "factor"), day1 = c(2, 0, 0),
day2 = c(1, 3, 3), day3 = c(2, 0, 3), day4 = c(1, 3, 3),
day5 = c(1, 3, 3), day6 = c(3, 0, 3), day7 = c(1, 3, 3)), row.names = c(NA,
3L), class = "data.frame")

Similar post R - identify consecutive sequences

If there are columns until day796, I just wonder, how many values you would consider as consescutive — akrun, Apr 13 '20 at 21:50
Consider that a row is `2 1 0 1 0 5 3 4 3 2 1 7 5 0 1 3 2 4 3 5 7 9 0 1 3 4 2 4 2 7 9..` — akrun, Apr 13 '20 at 21:52
the confusion is that you have a mappiing table for days that start from day11 and your iinput example is less than that — akrun, Apr 13 '20 at 21:56

score 1 · Accepted Answer · answered Apr 13 '20 at 23:20

We can use rleid from data.table to get the 'Occurance' correct

library(data.table)
wkdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")

out1 <-  do.call(rbind, Map(function(x, y) {
              i1 <- match(y, wkdays): length(x)
              i2 <- x[i1] != 0
              i3 <- all(i2)
              grp1 <- rleid(i2)
              Days <- if(i3) tapply(names(x)[i1][i2], grp1[i2], FUN = paste, collapse= ' ') else ''
             Occurance <- if(i3) length(grp1[i2]) else 0
             data.frame(Occurance, Days)
            }, asplit(df[-(1:3)], 1), df$Day))

 out1$Duration <- rowSums(df1[startsWith(names(df1), 'day')])
 out1
 # Occurance                               Days Duration
 #1         7 day1 day2 day3 day4 day5 day6 day7       11
 #2         0                                          12
 #3         5           day3 day4 day5 day6 day7       18

Thomas Jc · Answer 2 · 2020-04-13T13:56:02.100

0

You can make use of lead and lag of dplyr,

I tried it on my side and here is the result:

library(dplyr)

df %>% 
    select(serial, contains("day", ignore.case = FALSE)) %>% 
    group_by(serial) %>% 
    tidyr::gather(day, val, -serial) %>% 
    # convert to binary 
    mutate(occur = ifelse(val > 0, 1, 0)) %>% 
    # if detect a seq, add cumulative, else 0
    mutate(cums = ifelse(lead(occur) > 0 & lag(occur) > 0 & occur > 0, 
                         occur + cumsum(occur), 0)) %>% 
    summarise(occurance = max(cums, na.rm = T), duration = sum(val))

# A tibble: 3 x 3
  serial occurance duration
   <dbl>     <dbl>    <dbl>
1     10         6       18
2     12         7       11
3    123         0       12

edited Apr 13 '20 at 13:56

answered Apr 13 '20 at 12:29

Thomas Jc

137
2

many thanks maybe this is a big ask but how can I return the sequence days? – Rstudent Apr 13 '20 at 14:03
1

You can find out which row the maximum "occurrence" is at (using `which`) and works backwards to retrieve the respective rows. This is sort of a hack-ish way of doing it. Since you are fundamentally dealing with sequences, I recommend you to convert your data into time-series format and use relevant packages. – Thomas Jc Apr 13 '20 at 14:37
one more question please how can I group the results based on id so as in sample output. Basically instead of... summarise(occurance = max(cums, na.rm = T), duration = sum(val))...how can I receive an output wit id (grouped) occurance and duration. thank you for your time – Rstudent Apr 13 '20 at 15:32
do you mean that you want to group both with serial and id? Just do it directly `group_by(serial, id)`. – Thomas Jc Apr 13 '20 at 15:37
yes I do need both as serial and id1 define an individual observation - i think we need to create an index for this – Rstudent Apr 13 '20 at 15:39
but how can i see the result ordered based on (serial id1) occurance and duration ..at this moment this line ..mutate(cums = ifelse(lead(occur) > 0 & lag(occur) > 0 & occur > 0, occur + cumsum(occur), 0))...the order is done based on day – Rstudent Apr 13 '20 at 15:40

Identify consecutive sequences based on a given variable

2 Answers2

Linked