I'm doing a cox regression in R using survival::coxph()
and the Greg::timesplitter()
function to split up my data by time. The impression I got from the documentation is that time-splitting should, in itself, make little to no difference to the regression results. (Note that I'm not trying to model any time-dependent covariates or coefficients at this point, simply splitting the data by time).
This seems to be true for the main results of the regression itself (the coefficients output by coxph()
). However splitting sometimes (though not always) seems to make a large difference to the output of cox.zph()
, the test of the proportional hazards assumption. Is this expected and how should I interpret the results?
Edit: The behaviour also seems extremely sensitive to the data. If I change the example below by cutting down the data to fake_data[1:551,]
after creating it, I get roughly identical outputs from cox.zph()
for the regular and split models. But I get very different outputs if I cut the data down to fake_data[1:550,]
.
Code reproducing the phenomenon:
library(tidyverse)
library(survival)
library(Greg)
library(dplyr)
set.seed(56891)
fake_data <- tibble("id" = c(1:5000))
n <- nrow(fake_data)
fake_data$gender <- if_else(runif(n) > 0.5, "M", "F")
fake_data$event_status <- if_else(runif(n) > 0.3, 1, 0)
fake_data$time_to_event <- round(500*runif(n),3)
regular_model <- coxph(Surv(time_to_event, event_status == 1) ~ gender,
data = fake_data,
x = TRUE, y = TRUE)
`
spl_data <-
fake_data %>%
timeSplitter(by = 5,
event_var = "event_status",
event_start_status = 0,
time_var = "time_to_event")
split_model <- coxph(Surv(Start_time, Stop_time, event_status) ~gender,
data = spl_data,
x = TRUE, y = TRUE)
regular_model
split_model
cox.zph(regular_model)
cox.zph(split_model)