4

I'm performing a cross validation on a competing risks proportional hazards model. With help from the mstate pacakge, I've prepared my data and am fitting it with survival::coxph. I get a fitted Cox model object for my training data, but I want to evaluate the partial likelihood of my trained coefficients with my test data.

If I need to, I'll write the partial likelihood function myself, but I'd rather not (though it would probably be good for me). The survival package calculates in this C code, but the likelihood calculation is embedded in the fitting function. Maybe there's a way to fix parameters, or some other tools to easily get at the partial likelihood?

Minimum Working Exmaple

# Adapted from examples in the mstate vignette
# http://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf
# beginning at the bottom of page 28

library(mstate)
library(survival)

# Get data. I add a second explanatory variable (badx) for illustration
# Also divide the data by subject into training and test sets.
data(aidssi)
si <- aidssi # Just a shorter name
si$badx <- sample(c("A", "B"), size = nrow(si), replace = TRUE)
si$fold <- sample(c("train", "test"), size = nrow(si), replace = TRUE, prob = c(0.7, 0.3))
tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)

# Convert the data to a long competing risks format
silong <- msprep(time = c(NA, "time", "time"), 
                 status = c(NA,"stat1", "stat2"),
                 data = si, keep = c("ccr5", "badx", "fold"), trans = tmat)
silong <- na.omit(silong)
silong <- expand.covs(silong, c("ccr5", "badx"))
train.dat <- subset(silong, fold == "train")
test.dat <- subset(silong, fold == "test")

Data looks like this:

> head(silong)
An object of class 'msdata'

Data:
  id from to trans Tstart  Tstop   time status ccr5 badx  fold ccr5WM.1 ccr5WM.2 badxB.1 badxB.2
1  1    1  2     1      0  9.106  9.106      1   WW    A train        0        0       0       0
2  1    1  3     2      0  9.106  9.106      0   WW    A train        0        0       0       0
3  2    1  2     1      0 11.039 11.039      0   WM    B train        1        0       1       0
4  2    1  3     2      0 11.039 11.039      0   WM    B train        0        1       0       1
5  3    1  2     1      0  2.234  2.234      1   WW    B train        0        0       1       0
6  3    1  3     2      0  2.234  2.234      0   WW    B train        0        0       0       1

Now, the ccr5 variable could be modeled as transition-specific, or as a having equal proportional effect for all transitions. The models are:

train.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
                         data = train.dat)
train.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
                            data = train.dat)

Now I would like to use the test data to evaluate the variable selection on whether or not ccr5 should be transition-specific or not. I have a large data set and many variables--mostly but not all categorical--that could go either way. The evaluation is where I'm stuck.

# We can fit the same models to the test data,
# this yields new parameter estimates of course,
# but the model matrices might be useful
test.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
                         data = test.dat)
test.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
                            data = test.dat)
test.eq.mm <- model.matrix(test.mod.equal)
test.sp.mm <- model.matrix(test.mod.specific)

# We can use these to get the first part of the sum of the partial likelihood:
xbeta.eq <- test.eq.mm[test.dat$status == 1, ] %*% coef(train.mod.equal)
xbeta.sp <- test.sp.mm[test.dat$status == 1, ] %*% coef(train.mod.specific)

# We can also get linear predictors
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
lp.sp <- predict(train.mod.specific, newdata = test.dat, type = "lp")

I'm hoping to calculate the partial likelihood for each of the models on the test data with the training coefficient estimates. Maybe I should move the question to Cross Validated and ask if the sum of the linear predictors (or the sum of the linear predictors excluding censored cases) is close enough to an equivalent measure.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • The problem is that partial likelihoods vary along the course of the time variable as the risk set shrinks. If you explain what you are actually attempting,preferable with a small example set, specifics might become available. – IRTFM Nov 03 '14 at 20:51
  • @BondedDust I'm working up an example. I realize that the contributions to the likelihood due to each observation depends on which other variables are still in the risk set, but the sum partial log likelihood is still just a function of model parameters given data. I have both model parameters (from the training fit) and data (from the test set), I'm just lacking the function to compute the partial log likelihood. – Gregor Thomas Nov 03 '14 at 21:04
  • So you want to compute how far the predict(mdl, type="lp") is (on a log-likelihood scale) lies away from a set of data with censoring and outcome variables. Can you calculate a "neo-model" with a formula that includes an offset that uses beta estimates and then use summary(mdl) to do the heavy lifting for you? You might even be able to calculate the offset with `predict.coxph`. – IRTFM Nov 03 '14 at 21:25
  • Hmmm, I'll have to think on that. I like the new-model idea. I'm getting pulled off into a meeting, but I'll be back this evening with a MWE (which, if I'm really lucky, I'll post as an answer). – Gregor Thomas Nov 03 '14 at 21:47
  • MWE added, no luck with the "neo-model" approach, at least not by myself. – Gregor Thomas Nov 05 '14 at 01:25
  • My strategy seems to give sensible results, but perhaps you should review. Do you have a "correct answer" to compare? – IRTFM Nov 05 '14 at 07:25
  • I'll definitely be reviewing--you've about 80% convinced me thus far. My "correct answer" for the example above would be that the `ccr5` variable is superior as transition-specific, whereas the `badx` variable is not (ignoring the fact that `badx` is bad enough to be thrown out altogether). And I'm happy to penalize the transition specific models for for fitting an extra parameter. – Gregor Thomas Nov 05 '14 at 16:20
  • I'm going to start testing now to see if it would be reasonable to use `logLik()` to get the log likelihood of the neo-model, then add to that the LRT test statistic. Then I might be able to directly compare the `test.eq` and `test.sp` models. Or maybe I just need to run null models with each of the `offset`s... – Gregor Thomas Nov 05 '14 at 16:24

1 Answers1

3

This is what I was proposing when I wrote: 'Can you calculate a "neo-model" (using the [new data] with a formula that includes an offset [built with] beta estimates [from the original fit] and then use summary(mdl) to do the heavy lifting for you? You might even be able to calculate the offset with predict.coxph.' Turns out I don't need to use summary.coxph since print.coxph gives the LLR statistic.

 lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
 eq.test.mod <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq), 
   data=test.dat )
eq.test.mod

Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) + 
    offset(lp.eq), data = test.dat)


           coef exp(coef) se(coef)       z    p
ccr5WM -0.20841     0.812    0.323 -0.6459 0.52
badxB  -0.00829     0.992    0.235 -0.0354 0.97

Likelihood ratio test=0.44  on 2 df, p=0.804  n= 212, number of events= 74 

I would interpret this to mean that a similar model, fit with the predictions based on the first model but with new data, was not significantly different (than a null model) and that on a log-likelihood scale, it was 0.44 "away" from an exact fit.

As pointed out by @Gregor, one can access the 'loglik' node of the coxph-object, but I would advise against attaching too much meaning to the single values. To get he LRT statistic one could produce:

> diff(eq.test.mod$loglik)
[1] 0.399137

For interest sake, also look at the result without the offset:

> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans), 
+       data=test.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans), 
    data = test.dat)


          coef exp(coef) se(coef)      z      p
ccr5WM -0.8618     0.422    0.323 -2.671 0.0076
badxB  -0.0589     0.943    0.235 -0.251 0.8000

Likelihood ratio test=8.42  on 2 df, p=0.0148  n= 212, number of events= 74 

And you do get the expected result when testing against the original data:

> lp.eq2 <- predict(train.mod.equal, newdata = train.dat, type = "lp")
> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq2), 
+       data=train.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) + 
    offset(lp.eq2), data = train.dat)


            coef exp(coef) se(coef)         z p
ccr5WM -4.67e-12         1    0.230 -2.03e-11 1
badxB   2.57e-14         1    0.168  1.53e-13 1

Likelihood ratio test=0  on 2 df, p=1  n= 436, number of events= 146 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I think I can get even closer to what I want examining (newly named in your answer) `eq.test.mod$loglik`. Looking at `?coxph.object`, the first element in `$loglik` is the log likelihood with initial values. – Gregor Thomas Nov 05 '14 at 19:59
  • Thanks for your help with this, wish I had a more upvotes to give! – Gregor Thomas Nov 05 '14 at 19:59
  • OK, but do remember that the loglikelihood is not a "solid" number. It is only valid "up to a constant" which then gets implicitly "subtracted out" when doing the LRT. It is only by comparisons with a nested model that you can legitimately extract information. I do not think it would necessarily be appropriate to compare the 'loglik' values across two different datasets. – IRTFM Nov 05 '14 at 20:04
  • Though the models aren't strictly nested, I'll be using the same dataset for each--not comparing between test and train, just different parameterizations of the test data. So far, the results all look very reasonable. – Gregor Thomas Nov 05 '14 at 22:17