Basically, using tmerge()
to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.
One way uses start and end dates, and converts to numeric within Surv()
; the other just uses numeric days to event.
Example
First, create some data. We have an ID, an outcome (death
), a start date for each row, and an end date some time later. The start date and end date are Date
objects.
n <- 1000
set.seed(0)
dd <- data.frame(id=1:n,
death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE),
startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")),
max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) +
rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (You can check that endDate is never before startDate.)
Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:
dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)
Next, we use tmerge()
to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)
We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.
Using days
ddTv <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 0 3506.6 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 0 4570.6 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 0 3571.6 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 0 2955.1 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 0 2913.4 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 0 3615.2 FALSE
Using dates
ddTvDate <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 2005-04-23 2014-11-29 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 2010-08-13 2023-02-16 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 2013-09-11 2023-06-22 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 2007-08-31 2015-10-03 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 2019-02-05 2027-01-27 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 2002-05-14 2012-04-06 FALSE
Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:
Using days
ddMod <- coxph(formula=
Surv(time=tstart, time2=tstop, event=death) ~ 1,
data=ddTv)
ddMod
Null model
log likelihood= -702.08
n= 1000
Using dates
ddModDate <- coxph(formula=
Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1,
data=ddTvDate)
ddModDate
Null model
log likelihood= -681.85
n= 1000
Log-likelihoods are similar, but not the same.
Why are these not the same?
If you add covariates to the model then coefficients and p values between the two versions are again not the same.
Finally, if you don't use tmerge()
, and go straight to coxph()
on the original dataset, then both methods give you the same results. Both of these models
ddMod2 <- coxph(formula=
Surv(time=endDay, event=death) ~ 1,
data=dd)
ddMod2
ddModDate2 <- coxph(formula=
Surv(time=as.numeric(endDate - startDate), event=death) ~ 1,
data=dd)
ddModDate2
give the same results as ddMod
above, the version using days.