Univariate cox regression analysis with multiple covariates

Question

covariates <- c("age", "sex",  "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
                    function(x) as.formula(paste('Surv(time, status)~', x)))
                   
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data 
univ_results <- lapply(univ_models,
                   function(x){ 
                      x <- summary(x)
                      p.value<-signif(x$wald["pvalue"], digits=2)
                      wald.test<-signif(x$wald["test"], digits=2)
                      beta<-signif(x$coef[1], digits=2);#coeficient beta
                      HR <-signif(x$coef[2], digits=2);#exp(beta)
                      HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
                      HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
                      HR <- paste0(HR, " (", 
                                   HR.confint.lower, "-", HR.confint.upper, ")")
                      res<-c(beta, HR, wald.test, p.value)
                      names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", 
                                    "p.value")
                      return(res)
                      #return(exp(cbind(coef(x),confint(x))))
                     })
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)

Normally I use this code for univariate cox regression analysis but I have multiple genes >20000 that I want to run as independent variables in a univariate cox regression analysis and I am not sure how I can run this code without typing the individual covariates (gene names) out. All my column names for the genes begin with "ENSG..".

Is there a way to do univariate cox regression on so many genes in an efficient way please? Thanks in advance.

score 0 · Answer 1 · answered Aug 30 '22 at 16:15

There are several ways to make the list of variable names without typing them out. Probably one of the easiest is to use names() to get all of the variable names in the data, then remove time and status from that list (as well as any other variables you don't want to include). For example, for the veteran dataset:

covariates <- names(survival::veteran)
covariates # Look at which names were detected
#> [1] "trt"      "celltype" "time"     "status"   "karno"    "diagtime" "age"     
#> [8] "prior"
covariates <- covariates[-which(covariates %in% c("time", "status"))]
covariates # Confirm time and status have been removed
#> [1] "trt"      "celltype" "karno"    "diagtime" "age"      "prior"

^{Created on 2022-08-30 by the reprex package (v2.0.1)}

You could also programmatically create a list of names. For example:

covariates <- sapply(1:10, FUN = function(x) paste0("ENSG.", x))
covariates
#>  [1] "ENSG.1"  "ENSG.2"  "ENSG.3"  "ENSG.4"  "ENSG.5"  "ENSG.6"  "ENSG.7" 
#>  [8] "ENSG.8"  "ENSG.9"  "ENSG.10"

This approach might be better if the naming is easy to program. If the gene names are irregular it might be more difficult.

As far as efficiency, I don't think much can be done to improve the overall runtime. The bulk of the runtime is doing the actually coxph() calculations. There are other questions/answers on the site about optimizing R code. If you want to pursue optimization I'd suggest looking through those, and then making a new question with a reproducible example if you need more help.

Univariate cox regression analysis with multiple covariates

1 Answers1