I'm working with survival data, and using survSplit()
to deal with non-proportionality with time-dependent coefficients. The model is based on the article by Terry Therneau et. al. (2020) (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)
I have a factor variable with 6 levels to represent different types of knee prostheses. When I'm applying survSplit()
with any cutpoints, the coefficients for the reference level of this time-adjusted factor appears as NA in the results. There is no collinearity, and the problem can be reproduced with other factor variables in the data. Also, if I change the reference value by altering the factor levels, the reference value is NA anyways.
The problem is reproduced below with the factor variable celltype
in the veteran
data:
str(veteran)
'data.frame': 137 obs. of 8 variables:
$ trt : num 1 1 1 1 1 1 1 1 1 1 ...
$ celltype: Factor w/ 4 levels "squamous","smallcell",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time : num 72 411 228 126 118 10 82 110 314 100 ...
$ status : num 1 1 1 1 1 1 1 1 1 0 ...
$ karno : num 60 70 60 60 70 20 40 80 50 70 ...
$ diagtime: num 7 5 3 9 11 5 10 29 18 6 ...
$ age : num 69 64 38 63 65 49 69 68 43 70 ...
$ prior : num 0 10 0 10 10 0 10 0 0 0 ...
library(tidyverse)
library(survival)
library(survminer)
df <- veteran
cox <- coxph(Surv(time, status) ~ celltype + age + prior, data = df)
cox.zph(cox, terms=F)
cox_tdc <- survSplit(Surv(time, status) ~ .,
data= df,
cut=c(150),
zero=0,
episode= "tgroup",
id="id") %>%
dplyr::select(id, tstart, time, status, tgroup, celltype, age, prior)
coxph(Surv(tstart, time, status) ~
celltype:strata(tgroup) + age + prior,
data=cox_tdc)
Call:
coxph(formula = Surv(tstart, time, status) ~ celltype:strata(tgroup) +
age + prior, data = cox_tdc)
coef exp(coef) se(coef) z p
age 0.005686 1.005702 0.009494 0.599 0.549262
prior 0.008592 1.008629 0.020661 0.416 0.677516
celltypesquamous:strata(tgroup)tgroup=1 0.300732 1.350848 0.360243 0.835 0.403828
celltypesmallcell:strata(tgroup)tgroup=1 1.172992 3.231649 0.325177 3.607 0.000309
celltypeadeno:strata(tgroup)tgroup=1 1.232753 3.430660 0.352423 3.498 0.000469
celltypelarge:strata(tgroup)tgroup=1 NA NA 0.000000 NA NA
celltypesquamous:strata(tgroup)tgroup=2 -1.160625 0.313290 0.450989 -2.574 0.010067
celltypesmallcell:strata(tgroup)tgroup=2 -0.238994 0.787420 0.542002 -0.441 0.659252
celltypeadeno:strata(tgroup)tgroup=2 1.455195 4.285319 0.837621 1.737 0.082335
celltypelarge:strata(tgroup)tgroup=2 NA NA 0.000000 NA NA
Likelihood ratio test=34.54 on 8 df, p=3.238e-05
n= 171, number of events= 128
Likelihood ratio test=69.43 on 9 df, p=1.97e-11
n= 272, number of events= 128
The problem with this is that I cannot test the Shoenfeld residuals as cox.zph()
results an error: "Error in solve.default(imat, u) :
system is computationally singular: reciprocal condition number = 5.09342e-19". Because of the NAs.
Problem with NAs does not happen if I don't use time-dependent coefficients (:strata(tgroup)
)
Has anyone dealed with this problem previously? Why some of the coefficients are NAs? I really appreciate your help with this!
EDIT: example was changed to include reproducible data.
EDIT2: Fixed the time cutpoint in the example which resulted biased coefficients
EDIT3: I asked Terry Therneau about this problem and the email conversation is below:
Terry Therneau: There are two issues here.
The model.matrix routine is used by lm, glm, coxph and a host of other routines to create the X matrix for regression. It tries to be intelligent so as to not create redundant columns in X; those columns will end up with an NA coefficient. It is pretty good, but not perfect. Your case of a model with strata(tgroup): factor variable is one where it leaves in too many. The extra NA in the printout are a nuiscance, but not something that I can fix.
The cox.zph routine, on the other hand, is my problem. It should ignore those NA columns, and does not. There is actually code to check for the NA, but your example shows that it is incomplete. I will add an NA case, like yours, so my test suite and repair the problem. (The NA case worked once, but some update broke it.)
Me: When the I get the results with NA rows as in this case, are the coefficients still correct, and the NA represents the reference value?
Terry: Yes, all the coefficients are correct. SAS, for instance, does not try to 'pre-eliminate' columns and models with factors always have some missing in the coefficient vector. Since they use "." instead of "NA" for printing the missings don't jump off the page as much. Numerically, there is no penalty for doing in one way or the other.