I would like to implement diagnostics for Cox proportional hazards model with ggplot2
, by creating new stats
functions and ggproto
objects. The idea is to benefit from grouping (by color
, facet_grid
, etc.) for conditional computation of desired statistics (e.g. martingale residuals).
In the example below, the null model is refitted and martingale's computed for each cofactor level. The issues are:
- Fit is done based on
data
argument provided internally for theggproto
'scompute_group
method, not the actual dataset. This means, that dataframe is stripped of columns not explicitly defined inaes
so it's up to the user to provide at least "time to event" and "event indicator" columns each time. (unless they're hardcoded in a wrappper forstat_...
function). Cancompute_group
somehow access corresponding rows from the original dataset? - Rownames for
compute_group
'sdata
are unique perPANEL
, which means they do not reflect the actual rownames of original dataset supplited toggplot
, and we can no longer exclude e.g. incomplete cases omitted by full model, unless explicitly specifying/hardcoding some id variable. The question is same as above. - Can a
geom
layer access they
's computed by another custom stat, which is also plotted? Consider a scatterplot of residuals with a smoother, such as.ggplot(data, aes(x = covariate, time = time, event = event)) + stat_funcform() + geom_smooth(aes(y = ..martingales..))
, where..martingales..
are actually computed bystat_funcform
.
.
# Load & Set data ---------------------------------------------------------
require(data.table)
require(dplyr)
require(ggplot2)
require(survival)
set.seed(1)
dt <- data.table(factor = sample(c(0, 2), 1e3, T),
factor2 = sample(c(0, 2, NA), 1e3, T),
covariate = runif(1e3, 0, 1))
dt[, `:=`(etime = rexp(1e3, .01 * (.5 + factor + covariate)),
ctime = runif(1e3, 0, 1e2))]
dt[, `:=`(event = ifelse(etime < ctime, 1, 0),
time = ifelse(etime < ctime, etime, ctime))]
survfit(Surv(time, event) ~ factor + I(covariate > .5), data = dt) %>%
autoplot(fun = 'cloglog')
# Fit coxph model ---------------------------------------------------------
fit <- coxph(Surv(time, event) ~ factor + covariate, data = dt)
# Define custom stat ------------------------------------------------------
StatFuncform <- ggproto(
"StatFuncform", Stat,
compute_group = function(data, scales, params) {
head(data) %>% print
data$y <- update(params$fit, ~ 1, data = data) %>%
resid(type = 'martingale')
return(data)
}
)
stat_funcform <- function(mapping = NULL, data = NULL, geom = "point",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
stat = StatFuncform, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
# Plot --------------------------------------------------------------------
# computation unsuccessfull - no 'time' and 'event' variables
ggplot(data = dt,
mapping = aes(x = covariate,
color = factor(event))) +
stat_funcform(params = list(fit = fit)) +
facet_grid(~ factor, labeller = label_both, margins = TRUE)
# ok - 'time' and 'event' specified explicitly
ggplot(data = dt,
mapping = aes(x = covariate, time = time, event = event,
factor2 = factor2,
color = factor(event))) +
stat_funcform(params = list(fit = fit)) +
facet_grid(~ factor, labeller = label_both, margins = TRUE)
I'm aware there are already a few packages implementing diagnostics for survival analysis (e.g. survminer
) or ggplot2
's autoplot
methods, but they're rather providing wrappers instead of taking advantage of default ggplot2() + stat_
semantics.