Fit many glm models: improve speed

Question

I am writing a function to fit many glm models. To just give you some ideas about the function, I include a small section of my code. With the help of several SO users, the function works for my analysis purpose now. However, sometimes, particularly when the sample size is relatively small, it can take quite long time to finish the whole process. To reduce the time, I am considering changing some details of iterative maximization, such as maximum number of iterations. I have not found a way to do it, maybe because I am still not familiar with R terminology. Any suggestions to do this or other ways to reduce time would be appreciated.

all_glm <- function(crude, xlist, data, family = "binomial", ...) {
  # md_lst include formula for many models to be fitted  
  comb_lst <- unlist(lapply(1:n, function(x) combn(xlist, x, simplify=F)), recursive=F)
  md_lst   <- lapply(comb_lst,function(x) paste(crude, "+", paste(x, collapse = "+")))
  models  <- lapply(md_lst, function(x) glm(as.formula(x), family = family, data = data))
  OR      <- unlist(lapply(models, function(x) broom::tidy(x, exponentiate = TRUE)$estimate[2]))

}

EDIT Thanks to @BenBolker who directed me to the package fastglm, I end up with several r packages which could provide faster alternatives to glm. I have tried fastglm and speedglm. It appears than both are faster than glm on my machine.

library(fastglm)
library(speedglm)
# from 
set.seed(1)
n <- 25000
k <- 500
y <- rbinom(n, size = 1, prob = 0.5)
x <- round( matrix(rnorm(n*k),n,k),digits=3)
colnames(x) <-paste("s",1:k,sep = "")
df <- data.frame(y,x)
fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+")))   

# Fit three models: 
system.time(m_glm <- glm(fo, data=df, family = binomial))
system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
system.time(m_fastglm <- fastglm(x, y, family = binomial()))

> system.time(m_glm <- glm(fo, data=df, family = binomial))
   user  system elapsed 
  56.51    0.22   58.73 
> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
   user  system elapsed 
  17.28    0.04   17.55 
> system.time(m_fastglm <- fastglm(x, y, family = binomial()))
   user  system elapsed 
  23.87    0.09   24.12

have you tried the [fastglm](https://CRAN.R-project.org/package=fastglm) package? — Ben Bolker, Oct 17 '19 at 23:39
@BenBolker Thanks. No, I haven't, I will give a try to see if it makes a difference. — Zhiqiang Wang, Oct 17 '19 at 23:46
@BenBolker I have tried `fastglm`. It allows users to assign the threshold tolerance for convergence, and maximum number of iterations, but it requires `x` must be a matrix object, which may not be convenient for some end users . — Zhiqiang Wang, Oct 18 '19 at 00:33
are your predictors all numeric, or are some factors (categorical)? A [mcve] would be great. What are the typical dimensions of your problem (number of observations, number of predictors)? Have you considered a penalized (lasso/ridge) approach? — Ben Bolker, Oct 18 '19 at 00:47
Mixed with both `numeric` and `factor` variables. I would like to make the function generic for different types of predictors. I will try to work out it reproducible example. Thanks. — Zhiqiang Wang, Oct 18 '19 at 00:57
This is the tiniest of notes. I am unaware of a faster implementation in R, however with a bit of trickery possibly `openGL` or other fully `c++` based implementations might be slightly faster. As for the problem with `fastglm` only allowing numeric `x` and `y` i suggest using the standard glm call `glm(fo, data = df, family = binomial, method = "fastglm")`. The call to `glm` will take care of converting your formula to a model matrix and give the necessary input for `fastglm`. **Note** that printing the output does take a long time, if you use this method (for unknown reasons to me). — Oliver, Oct 18 '19 at 18:00
@Oliver Great to know how `method = ` works! I cannot use `fastglm` in my function for another reason: cannot `broom::tidy()` a `fastglm` object. `speedglm` works fine. Many thanks! — Zhiqiang Wang, Oct 19 '19 at 00:08

jjdd · Accepted Answer · 2019-11-03T14:35:35.760

The IRLS algorithm typically used for fitting glms requires matrix inversion/decomposition at each iteration. fastglm offers several different options for the decomposition and the default choice is a slower but more stable option (QR with column-pivoting). If your only interest is speed, then one of the two available Cholesky-type decompositions will improve the speed dramatically, which would be more advisable than just changing the number of IRLS iterations. Another notable difference between fastglm and standard IRLS implementations is its careful use of half-steps in order to prevent divergence (IRLS can diverge in practice in a number of cases).

The method argument of fastglm allows one to change the decomposition. option 2 gives the vanilla Cholesky decomposition and option 3 gives a slightly more stable version of this. On my computer, the timings for your provided example are:

> system.time(m_glm <- glm(fo, data=df, family = binomial))
   user  system elapsed 
 23.206   0.429  23.689 

> system.time(m_speedglm <- speedglm(fo, data= df, family = binomial()))
   user  system elapsed 
 15.448   0.283  15.756 

> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 2))
   user  system elapsed 
  2.159   0.055   2.218 

> system.time(m_fastglm <- fastglm(x, y, family = binomial(), method = 3))
   user  system elapsed 
  2.247   0.065   2.337

With regards to using broom with fastglm objects, I can look into that.

Another note about decompositions: When fastglm uses the QR decomposition, it is working with the design matrix directly. Although speedglm technically offers a QR decomposition, it works by first computing $X^TX$ and decomposing this, which is more numerically unstable than a QR on X.

Thanks a lot. I have learned something interesting and helpful from your answer. — Zhiqiang Wang, Nov 02 '19 at 23:19

Fit many glm models: improve speed

1 Answers1

Related