2

I need some help working with consecutive results.

Here is my sample data:

df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909, 
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645, 
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA, 
-16L))

The 'idno' represents individual people who had a test with 'result' on a given date ('samp_date').

From each individual person, I need to find the earliest consecutive 'Negatives' and return the date of the first 'negative' result. To return this date, the consecutive negatives must span >30 days with no 'positive' results.

The example answer for idno == 1 would be 2013-10-29, and 2015-11-06 for idno == 2.

I have tried using rle(as.character(df$result)) but have struggled to understand how to apply this to grouped data.

I would prefer an approach that uses dplyr or data.table.

Thanks for any help.

Jay Achar
  • 1,156
  • 9
  • 16

3 Answers3

2

Similar to @MKR's answer, you can make a grouping variable and summarize in data.table:

library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]

# summarize by grouping var g = rleid(idno, result)    
runDT = df[, .(
  start = first(samp_date),
  end  = last(samp_date),
  dur  = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]

#    idno   result g      start        end      dur
# 1:    1 Negative 1 2013-07-23 2013-07-23   0 days
# 2:    1 Positive 2 2013-08-21 2013-10-01  41 days
# 3:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 4:    2 Positive 4 2015-10-13 2015-10-13   0 days
# 5:    2 Negative 5 2015-11-06 2016-10-31 360 days
# 6:    2 Positive 6 2017-04-25 2017-09-20 148 days

# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30), 
  on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]

# filter
runDT[w]

#    idno   result g      start        end      dur
# 1:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 2:    2 Negative 5 2015-11-06 2016-10-31 360 days
Frank
  • 66,179
  • 8
  • 96
  • 180
  • I'd prefer to just do the join rather than using `w`, but the contents of `dur` would be filled with `min_dur`, which is not ideal... Not sure which issue this is related to, maybe https://github.com/Rdatatable/data.table/issues/1615 – Frank Jul 29 '18 at 14:21
  • 1
    Thanks for the response Frank - I've gone with the dplyr version simply because I find DT more difficult to understand. – Jay Achar Jul 29 '18 at 14:25
  • 1
    @Frank Use of rleid makes it easier in data.table. – MKR Jul 29 '18 at 14:38
  • @MKR You could edit your post to illustrate rleid here too (like https://stackoverflow.com/questions/33507868/is-there-a-dplyr-equivalent-to-data-tablerleid ). OP could use that function without needing the other data.table syntax. – Frank Jul 29 '18 at 15:17
  • 1
    Thanks @Frank. I just preferred to have only one package loaded. Moreover using `cumsum` based approach to get something similar to `rleid` seems easier and intuitive to me. – MKR Jul 29 '18 at 18:24
1

A dplyr based solution can be achieved by creating a group of consecutive occurrence of result column and then finally taking 1st occurrence that meets criteria:

library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>% 
  group_by(idno) %>%
  arrange(samp_date) %>%
  mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
  group_by(idno, result_grp) %>%
  filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
  slice(1) %>%
  ungroup() %>%
  select(-result_grp) 

# # A tibble: 2 x 3
# idno result   samp_date 
# <dbl> <ord>    <date>    
# 1  1.00 Negative 2013-10-29
# 2  2.00 Negative 2015-11-06
MKR
  • 19,739
  • 4
  • 23
  • 33
0
library(dplyr)
df %>% group_by(idno) %>% 
       mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0), 
              ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))


  # A tibble: 16 x 5
  # Groups:   idno [2]
       idno result   samp_date  time_diff ConsNegDate
      <dbl> <ord>    <date>         <dbl> <date>     
   1     1 Negative 2013-07-23         0 2013-10-29 
   2     1 Positive 2013-08-21         0 2013-10-29 
   3     1 Positive 2013-10-01         0 2013-10-29 
   4     1 Negative 2013-10-29       -34 2013-10-29 
   5     1 Negative 2013-12-02       -39 2013-10-29 
   6     1 Negative 2014-01-10      -102 2013-10-29 
   7     1 Negative 2014-04-22      -322 2013-10-29 
   8     1 Negative 2015-03-10       -72 2013-10-29 
   9     1 Negative 2015-05-21       -69 2013-10-29 
  10     1 Negative 2015-07-29        NA 2013-10-29 
  11     2 Positive 2015-10-13         0 2015-11-06 
  12     2 Negative 2015-11-06      -360 2015-11-06 
  13     2 Negative 2016-10-31         0 2015-11-06 
  14     2 Positive 2017-04-25         0 2015-11-06 
  15     2 Positive 2017-09-07         0 2015-11-06 
  16     2 Positive 2017-09-20         0 2015-11-06 
A. Suliman
  • 12,923
  • 5
  • 24
  • 37