I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.
I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!
Sample data:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
Desired output:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
Comment: For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).
EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]