7

I would like to implement diagnostics for Cox proportional hazards model with ggplot2, by creating new stats functions and ggproto objects. The idea is to benefit from grouping (by color, facet_grid, etc.) for conditional computation of desired statistics (e.g. martingale residuals). In the example below, the null model is refitted and martingale's computed for each cofactor level. The issues are:

  1. Fit is done based on data argument provided internally for the ggproto's compute_group method, not the actual dataset. This means, that dataframe is stripped of columns not explicitly defined in aes so it's up to the user to provide at least "time to event" and "event indicator" columns each time. (unless they're hardcoded in a wrappper for stat_... function). Can compute_group somehow access corresponding rows from the original dataset?
  2. Rownames for compute_group's data are unique per PANEL, which means they do not reflect the actual rownames of original dataset supplited to ggplot, and we can no longer exclude e.g. incomplete cases omitted by full model, unless explicitly specifying/hardcoding some id variable. The question is same as above.
  3. Can a geom layer access the y's computed by another custom stat, which is also plotted? Consider a scatterplot of residuals with a smoother, such as. ggplot(data, aes(x = covariate, time = time, event = event)) + stat_funcform() + geom_smooth(aes(y = ..martingales..)), where ..martingales.. are actually computed by stat_funcform.

.

# Load & Set data ---------------------------------------------------------

require(data.table)
require(dplyr)
require(ggplot2)
require(survival)
set.seed(1)

dt <- data.table(factor = sample(c(0, 2), 1e3, T),
                 factor2 = sample(c(0, 2, NA), 1e3, T),
                 covariate = runif(1e3, 0, 1))
dt[, `:=`(etime = rexp(1e3, .01 * (.5 + factor + covariate)),
          ctime = runif(1e3, 0, 1e2))]
dt[, `:=`(event = ifelse(etime < ctime, 1, 0),
          time = ifelse(etime < ctime, etime, ctime))]
survfit(Surv(time, event) ~ factor + I(covariate > .5), data = dt) %>% 
  autoplot(fun = 'cloglog')


# Fit coxph model ---------------------------------------------------------

fit <- coxph(Surv(time, event) ~ factor + covariate, data = dt)

# Define custom stat ------------------------------------------------------

StatFuncform <- ggproto(
  "StatFuncform", Stat,
  compute_group = function(data, scales, params) {
    head(data) %>% print
    data$y <- update(params$fit, ~ 1, data = data) %>% 
      resid(type = 'martingale')

    return(data)
  }
)

stat_funcform <- function(mapping = NULL, data = NULL, geom = "point",
                          position = "identity", na.rm = FALSE, show.legend = NA, 
                          inherit.aes = TRUE, ...) {
  layer(
    stat = StatFuncform, data = data, mapping = mapping, geom = geom, 
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}


# Plot --------------------------------------------------------------------

# computation unsuccessfull - no 'time' and 'event' variables    
ggplot(data = dt, 
       mapping = aes(x = covariate, 
                     color = factor(event))) + 
  stat_funcform(params = list(fit = fit)) + 
  facet_grid(~ factor, labeller = label_both, margins = TRUE)

# ok - 'time' and 'event' specified explicitly    
ggplot(data = dt, 
       mapping = aes(x = covariate, time = time, event = event, 
                     factor2 = factor2,
                     color = factor(event))) + 
  stat_funcform(params = list(fit = fit)) + 
  facet_grid(~ factor, labeller = label_both, margins = TRUE)

I'm aware there are already a few packages implementing diagnostics for survival analysis (e.g. survminer) or ggplot2's autoplot methods, but they're rather providing wrappers instead of taking advantage of default ggplot2() + stat_ semantics.

Mike Wise
  • 22,131
  • 8
  • 81
  • 104
mjktfw
  • 840
  • 6
  • 14
  • I didn't get any error message. I did get a warning: "Computation failed in `stat_funcform()`: Time variable is not numeric" – IRTFM Mar 20 '17 at 22:16
  • I've edited the comment. This might be the way ggplot handles error in `stat`'s computations. Nevertheless, the plot is not produced, and message is probably a result of accessing `time` function, instead of variable, due to scoping order. – mjktfw Mar 20 '17 at 22:24
  • After a bit of research, as far as I'am concerned, there are no special/environmental variables, that would allow to access the original data and/or preserve the row mapping. Also, the data is computed per layer, arguments are defined pretty explicitly, and methods involved during the printing procedure are supplied with individual layers, instead of the complete `ggplot` object. – mjktfw Mar 26 '17 at 16:19
  • I think, that the whole difficulty arises from the fact, that data is expected to be precomputed as data.frame, when the `ggplot` object is created (`fortify`, etc.), while the whole `stat`, `facet`, `aesthetics` logic kicks in while printing the object. I guess what I need is to move the data creation/evaluation step towards the printing. Some possibilities are: 1) insert an additional method called between `ggplot()` and `print()`; 2) construct wrapper around `ggplot.print`, 3) modify `ggplot2:::Layer$compute_layer()` function to take the whole plot as a parameter – mjktfw Mar 26 '17 at 16:26

0 Answers0