I am working through Chapter 15 of Applied Longitudinal Data-Analysis by Singer and Willett, on Extending the Cox Regression model, but the UCLA website here has no example R code for this chapter. I am trying to re-create the section on time-varying covariates and am stuck on how to create a count process dataset from the person-level dataframe provided. I have looked through the survival
package vignette but am having trouble creating my own dataset. Here is toy data modelled on the Singer and Willett example of predictors of time to first cocaine use in men between 17 and 40 years of age. id
is ID, ageInit
is age of either first cocaine use or censoring, used
is the event log, indicating first cocaine use (1
) or censoring (0
). There is one time invariant predictor earlyMJ
, indicating marijuana use prior to 17 years of age. There are also three time-varying covariates that need to inform the creation of the dataset: timeMJ
, which is age of first marijuana use after 17, sellMJ
which is age of selling marijuana, and odFirst
, which is age of first using some other drug. NA
s in these predictors indicate the participant did not perform the action in question at any time.
set.seed(1356)
df <- data.frame(id = 1:6,
ageInit = c(25,34,40,29,27,40),
used = c(0,1,1,0,1,1),
earlyMJ = c(0,0,1,0,1,1),
timeMJ = c(18,27,22,21,22,19),
sellMJ = c(NA,NA,25,NA,35,NA),
odFirst = c(19,22,35,NA,22,34))
Following Therneau et al.'s process from the survival
vignette we create a second dataset, in this case re-expressing the outcome variable ageInit
, and the time-varying predictors as number of years from study initiation (i.e. 17 years old).
tdata <- with(df, data.frame (id = id,
usedTime = ageInit - 17,
timeToFirstMJ = timeMJ - 17,
timeToSellMJ = sellMJ - 17,
timeToFirstOD = odFirst - 17,
usedCocaine = used))
The we merge this new data with the original dataset, creating, via the event()
call, two new columns expressing time intervals for each of the covariates. We also create, via the tdc()
call, a new row for each participant for each time they experienced one of the covariate events.
sdata <- tmerge(df, tdata,
id=id,
firstUse = event(futime, usedCocaine),
t1MJ = tdc(timeToFirstMJ),
t1SMJ = tdc(timeToSellMJ),
t1OD = tdc(timeToFirstOD),
options= list(idname="subject"))
attr(sdata, "tcount")
The issue is that when I run the Cox regression, using both the time-invariant, and a few of the time-varying covariates, I cannot make the model work.
coxph(Surv(tstart, tstop, firstUse) ~ earlyMJ + t1SMJ + t1OD, data= sdata, ties="breslow")
and get nonsensical coefficient and the warning message
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1,3 ; beta may be infinite.
Furthermore I do not even know if this is a proper count process dataframe, because in Singer and Willett they suggest that each participant needs to have a row for every time anyone in the dataset experiences the event.
Any guidance in these matters would be much appreciated.