6

I am working with R. Using a tutorial, I was able to create a statistical model and produce visual plots for some of the outputs:

#load libraries
library(survival)

library(dplyr)

library(ranger)

library(data.table)

library(ggplot2)

#use the built in "lung" data set
#remove missing values (dataset is called "a")

a <- na.omit(lung)

#create id variable

a$ID <- seq_along(a[,1])

#create test set with only the first 3 rows

new <- a[1:3,]

#create a training set by removing first three rows

a <- a[-c(1:3),]



#fit survival model (random survival forest)

r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)

#create new intermediate variables required for the survival curves

death_times <- r_fit$unique.death.times

surv_prob <- data.frame(r_fit$survival)

avg_prob <- sapply(surv_prob, mean)

#use survival model to produce estimated survival curves for the first three observations

pred <- predict(r_fit, new, type = 'response')$survival

pred <- data.table(pred)

colnames(pred) <- as.character(r_fit$unique.death.times)

#plot the results for these 3 patients

plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")

lines(r_fit$unique.death.times, pred[2,], type = "l", col = "green")

lines(r_fit$unique.death.times, pred[3,], type = "l", col = "blue")

enter image description here

Now, I am trying to convert the above plot into ggplot format (and add 95% confidence intervals):

ggplot(r_fit) + geom_line(aes(x = r_fit$unique.death.times, y = pred[1,], group = 1), color = red)  +  geom_ribbon(aes(ymin = 0.95 * pred[1,], ymax = - 0.95 * pred[1,]), fill = "red") + geom_line(aes(x = r_fit$unique.death.times, y = pred[2,], group = 1), color = blue) + geom_ribbon(aes(ymin = 0.95 * pred[2,], ymax = - 0.95 * pred[2,]), fill = "blue") + geom_line(aes(x = r_fit$unique.death.times, y = pred[3,], group = 1), color = green) + geom_ribbon(aes(ymin = 0.95 * pred[3,], ymax = - 0.95 * pred[3,]), fill = "green") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("sample graph")

But this produces the following error:

Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger
Run `rlang::last_error()` to see where the error occurred.

What is the reason for this error? Can someone please show me how to fix this problem?

Thanks

Sam Rogers
  • 787
  • 1
  • 8
  • 19
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    `r_fit` doesn't appear to be a data.frame and you're using `ggplot2` incorrectly. For example, you should refer to variables by their name, not passing the whole variable into the function. Like `aes(x = unique.death.times, ...)` over `aes(x = r_fit$unique.death.times, ...)`. – Roman Luštrik May 16 '21 at 16:52
  • 1
    `#plot the results for these 3 patients`. I see multiple events for each patient. Interesting... – Limey May 16 '21 at 17:39
  • 1
    @Limey: isn't that the point? – stats_noob May 16 '21 at 18:24

1 Answers1

7

As per the ggplot2 documentation, you need to provide a data.frame() or object that can be converted (coerced) to a data.frame(). In this case, if you want to reproduce the plot above in ggplot2, you will need to manually set up the data frame yourself.

Below is an example of how you could set up the data to display the plot in ggplot2.

Data Frame

First we create a data.frame() with the variables that we want to plot. The easiest way to do this is to just group them all in as separate columns. Note that I have used the as.numeric() function to first coerce the predicted values to a vector, because they were previously a data.table row, and if you don't convert them they are maintained as rows.

ggplot_data <- data.frame(unique.death.times = r_fit$unique.death.times,
                      pred1 = as.numeric(pred[1,]),
                      pred2 = as.numeric(pred[2,]),
                      pred3 = as.numeric(pred[3,]))
head(ggplot_data)
## unique.death.times     pred1     pred2     pred3
## 1                  5 0.9986676 1.0000000 0.9973369
## 2                 11 0.9984678 1.0000000 0.9824642
## 3                 12 0.9984678 0.9998182 0.9764154
## 4                 13 0.9984678 0.9998182 0.9627118
## 5                 15 0.9731656 0.9959416 0.9527424
## 6                 26 0.9731656 0.9959416 0.9093876

Pivot the data

This format is still not ideal, because in order to plot the data and colour by the correct column (variable), we need to 'pivot' the data. We need to load the tidyr package for this.

library(tidyr)
ggplot_data <- ggplot_data %>% 
  pivot_longer(cols = !unique.death.times, 
  names_to = "category", values_to = "predicted.value")

Plotting

Now the data is in a form that makes it really easy to plot in ggplot2.

plot <- ggplot(ggplot_data, aes(x = unique.death.times, y = predicted.value, colour = category)) +
      geom_line()
plot

ggplot

If you really want to match the look of the base plot, you can add theme_classic():

plot + theme_classic()

ggplot with theme_classic

Additional notes

Note that this doesn't include 95% confidence intervals, so they would have to be calculated separately. Be aware though, that a 95% confidence interval is not just 95% of the y value at a given x value. There are calculations that will give you the correct values of the confidence interval, including functions built into R.

For a quick view of a trend line with prediction intervals, you can use the geom_smooth() function in ggplot2, but in this case it adds a loess curve by default, and the intervals provided by that function.

plot + theme_classic() + geom_smooth()

ggplot with smooth trend

Sam Rogers
  • 787
  • 1
  • 8
  • 19
  • Thank you so much! Do you know if this can also be done using ggsurvminer? – stats_noob May 20 '21 at 03:07
  • 1
    You're most welcome, hope it helped :) If this answered your question, don't forget to mark accepted. The short answer is I don't know, sorry. The longer answer is that I don't think it will pass straight into ggsurvplot (assume this is what you mean?), because that function seems to expect objects from the `survival` package. I have essentially no experience with survival analysis, so can't really help. You can have a look at the documentation here: https://rdrr.io/cran/survival/man/survfit.html and see if you can work it out, or else post a new SO question. – Sam Rogers May 20 '21 at 03:22
  • can you please take a look at this question if you have time? https://stackoverflow.com/questions/68644940/r-error-optimization-resulting-in-unused-argumentx thank you – stats_noob Aug 05 '21 at 03:45