I am trying to fit a survival model with left-truncated data using the survival
package however I am unsure of the correct syntax.
Let's say we are measuring the effect of age at when hired (age
) and job type (parttime
) on duration of employment of doctors in public health clinics. Whether the doctor quit or was censored is indicated by the censor
variable (0 for quittting, 1 for censoring). This behaviour was measured in an 18-month window. Time to either quit or censoring is indicated by two variables, entry
(start time) and exit
(stop time) indicating how long, in years, the doctor was employed at the clinic. If doctors commenced employment after the window 'opened' their entry
time is set to 0. If they commenced employment prior to the window 'opening' their entry
time represents how long they had already been employed in that position when the window 'opened', and their exit
time is how long from when they were initially hired they either quit or were censored by the window 'closing'. We also postulate a two-way interaction between age
and duration of employment (exit
).
This is the toy data set. It is much smaller than a normal dataset would be, so the estimates themselves are not as important as whether the syntax and variables included (using the survival
package in R) are correct, given the structure of the data. The toy data has the exact same structure as a dataset discussed in Chapter 15 of Singer and Willet's Applied Longitudinal Data Analysis. I have tried to match the results they report, without success. There is not a lot of explicit information online how to conduct survival analyses on left-truncated data in R, and the website that provides code for the book (here) does not provide R code for the chapter in question. The methods for modeling time-varying covariates and interaction effects are quite complex in R and I just wonder if I am missing something important.
Here is the toy data
id <- 1:40
entry <- c(2.3,2.5,2.5,1.2,3.5,3.1,2.5,2.5,1.5,2.5,1.4,1.6,3.5,1.5,2.5,2.5,3.5,2.5,2.5,0.5,rep(0,20))
exit <- c(5.0,5.2,5.2,3.9,4.0,3.6,4.0,3.0,4.2,4.0,2.9,4.3,6.2,4.2,3.0,3.9,4.1,4.0,3.0,2.0,0.2,1.2,0.6,1.9,1.7,1.1,0.2,2.2,0.8,1.9,1.2,2.3,2.2,0.2,1.7,1.0,0.6,0.2,1.1,1.3)
censor <- c(1,1,1,1,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,rep(1,20))
parttime <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0)
age <- c(34,28,29,38,33,33,32,28,40,30,29,34,31,33,28,29,29,31,29,29,30,37,33,38,34,37,37,40,29,38 ,49,32,30,27,35,34,35,30,35,34)
doctors <- data.frame(id,entry,exit,censor,parttime,age)
Now for the model.
coxph(Surv(entry, exit, 1-censor) ~ parttime + age + age:exit, data = doctors)
Is this the correct way to specify the model given the structure of the data and what we want to know? An answer here suggests it is correct, but I am not sure whether, for example, the interaction variable is correctly specified.