1
# Create the simplest test data set 
test1 <- list(time=c(4,3,1,1,2,2,3), 
              status=c(1,1,1,0,1,1,0), 
              x=c(0,2,1,1,1,0,0), 
              sex=c(0,0,0,0,1,1,1)) 
# Fit a stratified model 
m=coxph(Surv(time, status) ~ x + sex, test1) 

y=predict(m,type="survival",by="sex")

Basically what I am doing is making fake data called test1, then I am fitting a simple coxph model and saving it as 'm'. Then what I aim to do is get the predicted probabilities and confidence bands for the survival probability separate for sexes. My hopeful dataset 'y' will include: age, survival probability, lower confidence band, upper confidence band, and sex which equals to '0' or '1'.

bvowe
  • 3,004
  • 3
  • 16
  • 33
  • Your question would make more sense if you explained your issue, not just you goal. Do you get an error? If so, what? Or perhaps the predictions don't seem right? Why? Or is there another problem? What is it? – Gregor Thomas Jun 17 '19 at 02:47
  • That said, in general if you want individual predictions from a survival model, I might recommend *not* using a Cox model. Cox PH is directly modeling only *relative* risk. In my experience, getting and testing predictions for individual cases is easier with other sorts of models. – Gregor Thomas Jun 17 '19 at 02:56
  • Confidence bands will apply to the `test1` dataset, not to any newdata object. Have you done any searching on confidence intervals for survival. Pretty sure I've seen earlier such questions. – IRTFM Jun 30 '19 at 21:39

2 Answers2

1

This can be accomplished in two ways. The first is a slight modification to your code, using the predict() function to get predictions at a specific times for specific combinations of covariates. The second is by using the survfit() function, which estimates the entire survival curve and is easy to plot. The confidence intervals don't exactly agree as we'll see, but they should match fairly closely as long as the probabilities aren't too close to 1 or 0.

Below is code to both make the predictions as your code tries. It uses the built-in cancer data. The important difference is to create a newdata which has the covariate values you're interested in. Because of the non-linear nature of survival probabilities it is generally a bad idea to try and make a prediction for the "average person". Because we want to get a survival probability we must also specify what time to consider that probability. I've taken time = 365, age = 60, and both sex = 1 and sex = 2 So this code predicts the 1-year survival probability for a 60 year old male and a 60 year old female. Note that we must also include status in the newdata, even though it doesn't affect the result.

library(survival)
mod <- coxph(Surv(time,status) ~ age + sex, data = cancer)
pred_dat <- data.frame(time = c(365,365), status = c(2,2),
                       age = c(60,60), sex = c(1,2))
preds <- predict(mod, newdata = pred_dat,
                 type = "survival", se.fit = TRUE)
pred_dat$prob <- preds$fit
pred_dat$lcl <- preds$fit - 1.96*preds$se.fit
pred_dat$ucl <- preds$fit + 1.96*preds$se.fit
pred_dat
#>   time status age sex      prob       lcl       ucl
#> 1  365      2  60   1 0.3552262 0.2703211 0.4401313
#> 2  365      2  60   2 0.5382048 0.4389833 0.6374264

We see that for a 60 year old male the 1 year survival probability is estimated as 35.5%, while for a 60 year old female it is 53.8%.

Below we estimate the entire survival curve using survfit(). I've saved time by reusing the pred_dat from above, and because the plot gets messy I've only plotted the male curve, which is the first row. I've also added some flair, but you only need the first 2 lines.

fit <- survfit(mod, newdata = pred_dat[1,])
plot(fit, conf.int = TRUE)
title("Estimated survival probability for age 60 male")
abline(v = 365, col = "blue")
abline(h = pred_dat[1,]$prob, col = "red")
abline(h = pred_dat[1,]$lcl, col = "orange")
abline(h = pred_dat[1,]$ucl, col = "orange")

Created on 2022-06-09 by the reprex package (v2.0.1)

I've overlaid lines corresponding to the predicted probabilities from part 1. The red line is the estimated survival probability at day 365 and the orange lines are the 95% confidence interval. The predicted survival probability matches, but if you squint closely you'll see the confidence interval doesn't match exactly. That's generally not a problem, but if it is a problem you should trust the ones from survfit() instead of the ones calculated from predict().

You can also dig into the values of fit to extract fitted probabilities and confidence bands, but the programming is a little more complicated because the desired time doesn't usually match exactly.

0

Section 5 of this document by Dimitris Rizopoulos discusses how to estimate Survival Probabilities from a Cox model. Dimitris Rizipoulos states:

the Cox model does not estimate the baseline hazard, and therefore we cannot directly obtain survival probabilities from it. To achieve that we need to combine it with a non-parametric estimator of the baseline hazard function. The most popular method to do that is to use the Breslow estimator. For a fitted Cox model from package survival these probabilities are calculated by function survfit(). As an illustration, we would like to derive survival probabilities from the following Cox model for the AIDS dataset:

He then goes on to provide R code that shows how to estimate Survival Probabilities at specific follow-up times.

I found this useful, it may help you too.

Fritz45
  • 752
  • 1
  • 10
  • 23