Why coxph() results some of the coefficient as NA when using survSplit() in R?

Question

I'm working with survival data, and using survSplit() to deal with non-proportionality with time-dependent coefficients. The model is based on the article by Terry Therneau et. al. (2020) (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)

I have a factor variable with 6 levels to represent different types of knee prostheses. When I'm applying survSplit() with any cutpoints, the coefficients for the reference level of this time-adjusted factor appears as NA in the results. There is no collinearity, and the problem can be reproduced with other factor variables in the data. Also, if I change the reference value by altering the factor levels, the reference value is NA anyways.

The problem is reproduced below with the factor variable celltype in the veteran data:

 str(veteran)
'data.frame':   137 obs. of  8 variables:
 $ trt     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ celltype: Factor w/ 4 levels "squamous","smallcell",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ time    : num  72 411 228 126 118 10 82 110 314 100 ...
 $ status  : num  1 1 1 1 1 1 1 1 1 0 ...
 $ karno   : num  60 70 60 60 70 20 40 80 50 70 ...
 $ diagtime: num  7 5 3 9 11 5 10 29 18 6 ...
 $ age     : num  69 64 38 63 65 49 69 68 43 70 ...
 $ prior   : num  0 10 0 10 10 0 10 0 0 0 ...

library(tidyverse)
library(survival)
library(survminer)

df <- veteran

cox <- coxph(Surv(time, status) ~ celltype + age + prior, data = df)

cox.zph(cox, terms=F)

cox_tdc <- survSplit(Surv(time, status) ~ ., 
                     data= df, 
                     cut=c(150),
                     zero=0,
                     episode= "tgroup", 
                     id="id") %>% 
  dplyr::select(id, tstart, time, status, tgroup, celltype, age, prior)

coxph(Surv(tstart, time, status) ~ 
                       celltype:strata(tgroup) + age + prior, 
                     data=cox_tdc)

Call:
coxph(formula = Surv(tstart, time, status) ~ celltype:strata(tgroup) + 
    age + prior, data = cox_tdc)

                                              coef exp(coef)  se(coef)      z        p
age                                       0.005686  1.005702  0.009494  0.599 0.549262
prior                                     0.008592  1.008629  0.020661  0.416 0.677516
celltypesquamous:strata(tgroup)tgroup=1   0.300732  1.350848  0.360243  0.835 0.403828
celltypesmallcell:strata(tgroup)tgroup=1  1.172992  3.231649  0.325177  3.607 0.000309
celltypeadeno:strata(tgroup)tgroup=1      1.232753  3.430660  0.352423  3.498 0.000469
celltypelarge:strata(tgroup)tgroup=1            NA        NA  0.000000     NA       NA
celltypesquamous:strata(tgroup)tgroup=2  -1.160625  0.313290  0.450989 -2.574 0.010067
celltypesmallcell:strata(tgroup)tgroup=2 -0.238994  0.787420  0.542002 -0.441 0.659252
celltypeadeno:strata(tgroup)tgroup=2      1.455195  4.285319  0.837621  1.737 0.082335
celltypelarge:strata(tgroup)tgroup=2            NA        NA  0.000000     NA       NA

Likelihood ratio test=34.54  on 8 df, p=3.238e-05
n= 171, number of events= 128 

Likelihood ratio test=69.43  on 9 df, p=1.97e-11
n= 272, number of events= 128

The problem with this is that I cannot test the Shoenfeld residuals as cox.zph() results an error: "Error in solve.default(imat, u) : system is computationally singular: reciprocal condition number = 5.09342e-19". Because of the NAs.

Problem with NAs does not happen if I don't use time-dependent coefficients (:strata(tgroup))

Has anyone dealed with this problem previously? Why some of the coefficients are NAs? I really appreciate your help with this!

EDIT: example was changed to include reproducible data.

EDIT2: Fixed the time cutpoint in the example which resulted biased coefficients

EDIT3: I asked Terry Therneau about this problem and the email conversation is below:

Terry Therneau: There are two issues here.

The model.matrix routine is used by lm, glm, coxph and a host of other routines to create the X matrix for regression. It tries to be intelligent so as to not create redundant columns in X; those columns will end up with an NA coefficient. It is pretty good, but not perfect. Your case of a model with strata(tgroup): factor variable is one where it leaves in too many. The extra NA in the printout are a nuiscance, but not something that I can fix.
The cox.zph routine, on the other hand, is my problem. It should ignore those NA columns, and does not. There is actually code to check for the NA, but your example shows that it is incomplete. I will add an NA case, like yours, so my test suite and repair the problem. (The NA case worked once, but some update broke it.)

Me: When the I get the results with NA rows as in this case, are the coefficients still correct, and the NA represents the reference value?

Terry: Yes, all the coefficients are correct. SAS, for instance, does not try to 'pre-eliminate' columns and models with factors always have some missing in the coefficient vector. Since they use "." instead of "NA" for printing the missings don't jump off the page as much. Numerically, there is no penalty for doing in one way or the other.

Hi, your example is not [reproducible](https://stackoverflow.com/help/minimal-reproducible-example) because we don't have access to the data you use. You should consider using toy data (```mtcars```, ```iris```...) or ```dput``` to make your data reproducible. — bretauv, Mar 08 '20 at 13:40
Thank you for this comment @bretauv. I changed to example to be reproducible with the `veteran` data — Vilponk, Mar 08 '20 at 17:29
thanks but you should also include the packages you use because this dataset is not available for anyone but is included in a package (I suppose). Besides, you use functions that are not in base R but in some packages — bretauv, Mar 08 '20 at 17:31
If you look at teh results you see that YOUR COEFFICIENTS ARE SERIOUSLY WEIRD: `celltypesquamous:strata(tgroup)tgroup=1 1.609e+01 9.714e+06 1.952e+03` So you've created a pathological data situation. — IRTFM, Mar 09 '20 at 06:22
Thanks @42- ! The coefficients were weird because of the time cutpoints. Now they are fixed, yet the NAs are still there. — Vilponk, Mar 09 '20 at 15:58
Have you looked at any tabular analyses to see if 'celltypelarge' has any events in the earlier time strata? (After doing a simple tabular look: `with(cox_tdc, table(tgroup, celltype, status))`, I was surprised you were not getting more NA's.) Survival analysis is all about the events, not the numbers at risk. — IRTFM, Mar 10 '20 at 00:10
Good point @42-, It seems that lack of events was not the case as `celltypelarge` had events in both time groups. I checked the table in my real data too and there was no zeros. — Vilponk, Mar 10 '20 at 04:19
I can tell you that reversing the order of the factors will change the level-by-stratum terms that show NA's, so it is not the case that it is something peculiar to that factor level's event counts. Furthermore running the example in the cited paper shows that Therneau was illustrating using time-splits interacting with a continuous variable. So maybe these two NA terms are the usual anti-aliasing behavior of R regression functions with factor variables. You don't notice it when the strata() term is omitted because that level's term is silently omitted to prevent singularity. — IRTFM, Mar 10 '20 at 16:52
I think this might constitute a MCVE that you could submit to Terry Therneau whose current email address is easily obtained with `maintainer("survival")`. He might be able to offer a workaround and he might also fix the code of `cox.zph` to more smoothly deal with time-splits interacting with factor variables. — IRTFM, Mar 10 '20 at 17:09
Thank you @42- for the hint. Terry Therneau has now fixed the `cox.zph` function and the error is now resolved! — Vilponk, Mar 20 '20 at 15:10

Why coxph() results some of the coefficient as NA when using survSplit() in R?

0 Answers0