tmerge() + coxph(): two ways of setting up dates should give same results, and don't

Question

Basically, using tmerge() to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.

One way uses start and end dates, and converts to numeric within Surv(); the other just uses numeric days to event.

Example

First, create some data. We have an ID, an outcome (death), a start date for each row, and an end date some time later. The start date and end date are Date objects.

n <- 1000
set.seed(0)
dd <- data.frame(id=1:n, 
  death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE), 
  startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")), 
    max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) + 
    rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (You can check that endDate is never before startDate.)

Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:

dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)

Next, we use tmerge() to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)

We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.

Using days

ddTv <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]

id death  startDate    endDate startDay endDay tstart  tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6      0 3506.6  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6      0 4570.6 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6      0 3571.6 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1      0 2955.1 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4      0 2913.4  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2      0 3615.2 FALSE

Using dates

ddTvDate <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]

id death  startDate    endDate startDay endDay     tstart      tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6 2005-04-23 2014-11-29  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6 2010-08-13 2023-02-16 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6 2013-09-11 2023-06-22 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1 2007-08-31 2015-10-03 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4 2019-02-05 2027-01-27  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2 2002-05-14 2012-04-06 FALSE

Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:

Using days

ddMod <- coxph(formula=
    Surv(time=tstart, time2=tstop, event=death) ~ 1, 
  data=ddTv)
ddMod

Null model
  log likelihood= -702.08 
  n= 1000

Using dates

ddModDate <- coxph(formula=
    Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1, 
  data=ddTvDate)
ddModDate

Null model
  log likelihood= -681.85 
  n= 1000

Log-likelihoods are similar, but not the same.

Why are these not the same?

If you add covariates to the model then coefficients and p values between the two versions are again not the same.

Finally, if you don't use tmerge(), and go straight to coxph() on the original dataset, then both methods give you the same results. Both of these models

ddMod2 <- coxph(formula=
    Surv(time=endDay, event=death) ~ 1, 
  data=dd)
ddMod2

ddModDate2 <- coxph(formula=
    Surv(time=as.numeric(endDate - startDate), event=death) ~ 1, 
  data=dd)
ddModDate2

give the same results as ddMod above, the version using days.

"(Note: this is a minimal example that does not actually have any time-varying covariates.)" can you make one with a time varying covariate? — Mike, Jun 12 '23 at 18:54
@Mike I could, but it would just make the example harder to follow and we'd see the same effect. Can you explain how it would help? — eac2222, Jun 13 '23 at 12:49
fair point, I guess I was trying to see if the discrepant results would still show under the use case with an actual time varying covariate. Follow-up do you think this is more of a programming question or stats question? If it is a stats question you might get better help on stat exchange, found an interesting reference comparing the formulas of the generalized cox model to a time dependent model https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6015946/ — Mike, Jun 13 '23 at 13:14
@Mike the answer is yes, this example is pared down from real models with real time-varying and non-varying covariates, where I first noticed the discrepancy. And I'm pretty sure it isn't stats, theoretically the two models are the same. — eac2222, Jun 13 '23 at 14:09
Ok I see, I believe it is because in the date model your start time (`tstart`) is not 0 it is however far the start time is from 1970. In your ddModDate2 model it assumes start time is time 0 and the difference between the dates is the end time. — Mike, Jun 13 '23 at 16:24
I don't think you should use the actual dates as start and stop times, instead I would use the days — Mike, Jun 13 '23 at 16:25

score 0 · Accepted Answer · answered Jun 22 '23 at 15:19

Professor Terry M. Therneau (creator of the survival package) kindly gave me an answer, with permission to post here.

Paraphrasing ---

Basically, the results are different because those are two entirely different models. Consider, for example, a participant who had an event on the 100th day that they were in the study, on January 1, 2010.

If I use calendar dates for my times, then the risk set for that event is everyone who was in the study on January 1, 2010.
If I use time since entry for my times, then the risk set for that event is everyone who was still in the study on their 100th day since entry.

Those are probably very different sets of people!

For almost every study, time since entry is the measure you want.

Obvious once he points it out, opaque to me until then.

tmerge() + coxph(): two ways of setting up dates should give same results, and don't

1 Answers1