How to speed up GLM estimation?

Question

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.

I am trying to fit a logistic regression with approximately 1500 parameters.

R is using 7% CPU and has 60+GB memory and is still taking a very long time.

Here is the code:

glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))), 
  family = binomial(logit), data = df[1:150000,])

Any suggestions to speed this up by a significant amount?

I don't have an immediate suggestion on speed but as far as inference goes you should not be using `var+I(var^2)`. Instead you should use `poly(var,2)`. You have constructed an incredibly complex formula and it is not at all clear that you need such a monster. You should describe the research question and get further advice about analysis design, and you should probably do so over at CrossValidated. — IRTFM, Apr 29 '13 at 17:40
I doubt that fitting 1500 parameters will give a useful result. — Roland, Apr 29 '13 at 17:40
Interesting technical question, although I agree with the other commenters' concerns. (1) There is a `fastLm` function in the `RcppArmadillo` package that illustrates how to speed up linear regression http://gallery.rcpp.org/articles/fast-linear-model-with-armadillo/ , but re-implementing GLM would be more work. (2) Installing an optimized BLAS library might be lower-hanging fruit: http://www.r-bloggers.com/faster-r-through-better-blas/ . (3) Linear regression might work OK, although N/P is only 133 in this case. (4) Try penalized GLM via the `glmnet` package ... — Ben Bolker, Apr 29 '13 at 17:43
(5) since some of your predictors are factors, you might buy some speed by using a sparse model matrix (see `?glm.fit` and `?sparse.model.matrix` in the `Matrix` package) -- especially if your factors have many levels. — Ben Bolker, Apr 29 '13 at 17:56
Thanks Ben, factor(X1) has ~40 levels, Factor(x7) has 3. Is this sparse enough for the Matrix package? — Will Beauchamp, Apr 29 '13 at 18:17
You should seriously consider using `glmnet` it's really fast (it uses gradient descent) and with 1500 parameters to fit I don't think that regularization (through elasticnet) would hurt.... — dickoa, Apr 29 '13 at 20:09
glmnet looks interesting dickoa, but I am having trouble making my variables+formula into a matrix which glmnet can use, any advice? — Will Beauchamp, Apr 30 '13 at 13:48

Gregor Thomas · Accepted Answer · 2021-12-02T18:40:16.680

11

There are a couple packages to speed up glm fitting. fastglm has benchmarks showing it to be even faster than speedglm.

You could also install a more performant BLAS library on your computer (as Ben Bolker suggests in comments), which will help any method.

edited Dec 02 '21 at 18:40

answered Apr 29 '13 at 18:52

Gregor Thomas

136,190
20
167
294

can I use speedglm on stepwise? – mql4beginner May 12 '14 at 07:03
Don't use stepwise. Use the lasso instead. – Gregor Thomas Dec 02 '21 at 18:44

score 7 · Answer 2 · answered Apr 10 '14 at 08:11

Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.

Benjamin Christoffersen · Answer 3 · 2019-01-19T11:55:43.120

5

Assuming that your design matrix is not sparse, then you can also consider my package parglm. See this vignette for a comparison of computation times and further details. I show a comparison here of computation times on a related question.

One of the methods in the parglm function works as the bam function in mgcv. The method is described in detail in

Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155.

On advantage of the method is that one can implement it with non-concurrent QR implementation and still do the computation in parallel. Another advantage is a potentially lower memory footprint. This is used in mgcv's bam function and could also be implemented here with a setup as in speedglm's shglm function.

edited Jan 19 '19 at 11:55

answered Nov 17 '18 at 20:07

Benjamin Christoffersen

4,703
15
37

1

can you add a tiny bit more context about what `parglm` does (i.e. how it achieves computational effiiciency)? – Ben Bolker Nov 17 '18 at 20:10
2

I have done as in the `bam` function in `mgcv` by first estimating a series of QR decomposition on different chunks of the data. The results are then combined at the end. Further details can be found in "*Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155*". – Benjamin Christoffersen Nov 17 '18 at 20:13
2

That's great. Can you edit this explanation into your question (comments are ephemeral)? – Ben Bolker Nov 17 '18 at 20:14

How to speed up GLM estimation?

3 Answers3

Linked

Related