5

I'm looking for suggestions for a strategy of fitting generalized linear mixed-effects models for a relative large data-set.

Consider I have data on 8 milllion US basketball passes on about 300 teams in 10 years. The data looks something like this:

data <- data.frame(count = c(1,1,2,1,1,5),
               length_pass= c(1,2,5,7,1,3),
               year= c(1,1,1,2,2,2),
               mean_length_pass_team= c(15,15,9,14,14,8),
               team= c('A', 'A', 'B', 'A', 'A', 'B'))
data
 count length_pass year mean_length_pass_team team
1     1           1    1                    15    A
2     1           2    1                    15    A
3     2           5    1                     9    B
4     1           7    2                    14    A
5     1           1    2                    14    A
6     5           3    2                     8    B

I'm want to explain the count of steps a player takes before passing the ball. I have theoretical motivations to assume there are team-level differences between count and length_pass, so a multi-level (i.e. mixed effects) model seems appropriate.

My individual level control variables are length_pass and year.

On the team-level I have mean_length_pass_team. This should help me to avoid ecological fallacies, according to Snijders, 2011.

I have been using the lme4 and brms packages to estimate these models but it takes days/weeks to fit these models on my local 12-core 128GB machine.

library(lme4)
model_a <- glmer(count ~ length_pass + year + mean_length_pass_team + (1 | team),
                 data=data,
                 family= "poisson",
                 control=glmerControl(optCtrl=list(maxfun=2e8))) 

library(brms)
options (mc.cores=parallel::detectCores ())
model_b <- brm(count ~ length_pass + year + mean_length_pass_team + (1 | team),
                 data=data,
                 family= "poisson")

I am looking for suggestions to speed up the fitting process or new techniques to fit a generalized linear mixed-effects model:

  • (How) Can I improve the speed on the lme4 and brms fits?
  • Are there other packages to consider?
  • Are there step-wise procedures that can help increase the speed of fitting models?
  • Are there interesting options outside the R environment that can help me fit this?

Any pointers are much appreciated!

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
wake_wake
  • 1,332
  • 2
  • 19
  • 46
  • Maybe [this](https://www.rdocumentation.org/packages/biglm/versions/0.9-1/topics/bigglm) can help. – F. Privé Oct 23 '17 at 15:49
  • @F.Privé it looks like the `biglm` package doesn't accept a multilevel formula - that is, the | is problematic. But thanks for your thoughts! – wake_wake Oct 23 '17 at 18:04
  • 1
    might not help much but https://stackoverflow.com/questions/44677487/lme4glmer-vs-statas-melogit-command/44728498#44728498 suggests `nAGQ=0` for speed up or try Julia – user20650 Oct 23 '17 at 18:12
  • Maybe you could try Stan with Automated Variational Inference? I tried it about a year ago and it seemed a bit buggy, but I'm sure they've made improvements since then. – RobertMyles Oct 23 '17 at 19:48
  • @RobertMc This means that the models are not being fit with MCMC sampling, right? Is that much faster? – wake_wake Oct 24 '17 at 07:14
  • 2
    It's faster, much faster (because it's not MCMC). Not sure if I trust it 100% though, but you could give it a shot. [Paper here](https://arxiv.org/abs/1506.03431) – RobertMyles Oct 24 '17 at 13:07
  • @user20650 The `nAGQ = 0` command helps to speed up significantly. Julia also seems a good option! Thanks – wake_wake Oct 25 '17 at 12:51

2 Answers2

1

I have found the package MCMCglmm to be much faster than brms for models that MCMCglmm can fit (I've sometimes found brms fits models I can't fit with MCMCglmm).

You may need to toy around with the syntax, but it would be something like this:

    MCMCglmm(data = data, family = "poisson",
             fixed = count ~ year, 
             random = ~ team)

A downside is that I have found it difficult in the past to find many online code examples that are connected to an explicit mathematical formulation of the models--it can be difficult to judge whether you are fitting the model you intent to fit. However, your model seems simple enough.

0

For general speed improvements, I would suggest using openBLAS instead of the native BLAS. Unfortunately, I don't believe LME4 relies on BLAS.

However, I can also suggest generating the LME4 models in parallel, which would effectively cut your wait time in half.

  • 1
    indeed lme4 does not use BLAS, because it needs some fancy linear algebra that lives in the `Eigen` package ... – Ben Bolker Dec 30 '17 at 23:56