2

I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.

I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!

Sample data:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2012-02-01    2012-03-01   2
2   2014-06-11      2012-02-01    2012-03-01   2
3   2011-08-14      NA            NA           NA 
")

Desired output:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2005-03-16    2008-10-10   0   
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2008-10-13    2014-11-04   0
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2008-05-10    2012-01-31   0
2   2013-05-10      2012-02-01    2012-03-01   2
2   2013-05-10      2012-03-02    2018-05-10   0
2   2014-06-11      2009-06-11    2012-01-31   0
2   2014-06-11      2012-02-01    2012-03-01   2
2   2014-06-11      2012-03-02    2019-06-11   0
3   2011-08-14      2006-08-14    2016-08-14   0 
")

Comment: For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).

EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:

 DT2 <- DT[,{

        # +/-5 years from t0
        sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
        edt <- seq(reference_date, by="5 years", length.out=2L)[2L]

        if(is.na(start[1L])) {
          # replace NA with full time interval for 'healthy'
          .(period_start=sdt, period_end=edt, status='notsick')
        } else{
          # Add date for -5 years if it is the minimum, otherwise use existing minimum
          if (sdt < period_start[1L]) {
            period_start <- c(sdt, period_start)
          }
          # Add date for +5 years if it is the maximum, otherwise use existing maximum
          if (edt > period_end[.N]) {
            period_end <- c(period_end,edt)
          }
          dates=unique(sort(c(period_start, period_end+1L)))
          .(start=dates[-length(dates)],end=dates[-1L]-1,status='')
        }
      },
      .(id,reference_date)]

      ## (c). Collapse status for overlapping periods
      DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
        status <- paste(status, collapse = ";")
        .(status=status)},
        by = .EACHI, allow.cartesian = TRUE]
Rico
  • 69
  • 1
  • 6

1 Answers1

1

here is an option:

interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
        sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
        edt <- seq(reference_date, by="5 years", length.out=2L)[2L]

        if(is.na(period_start[1L])) {
            .(period_start=sdt, period_end=edt, Status=0L)
        } else {    
            if (sdt < period_start[1L]) {
                period_start <- c(sdt, period_start)
            } 
            ps <- as.IDate(sort(interweave(period_start, period_end+1L)))

            if (period_end[.N] > edt) {
                ps <- ps[-length(ps)]
                pe <- period_end[.N]
            } else {
                pe <- edt
            }
            .(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
        }
    },
    .(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans

data:

library(data.table)
DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2012-02-01    2012-03-01   2
2   2014-06-11      2012-02-01    2012-03-01   2
3   2011-08-14      NA            NA           NA 
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]

Reference: Alternate, interweave or interlace two vectors

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • 1
    Wonderful! Clear code and fast. I almost tore my hair off trying to solve it. Thank you – Rico Jan 22 '20 at 08:49