Backward Elimination for Cox Regression

Question

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).

I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:

coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)


final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')

Study `help("formula")`. If I understand you correctly, you want to wrap the RHS in `(...)^2`. — Roland, May 18 '17 at 08:48
You have now completely changed the question. Stepwise regression is a [fundamentally flawed approach](https://stats.stackexchange.com/a/20856/11849). Can't help you with that. — Roland, May 18 '17 at 09:10

IRTFM · Answer 1 · 2017-05-19T16:47:57.260

The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:

coxph(Surv(wmonth,chldage1) ~ 
             ( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2, 
      data=pneumon)

I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)

First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.

Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.

If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

Backward Elimination for Cox Regression

1 Answers1