Running random error model with mgcv gam takes too much memory

Question

I am working on a model that includes several REs and a spline for one of the variables, so I am trying to use gam(). However, I reach memory exhaust limit error (even when I run it on a cluster with 128GB). This happens even when I run the simplest of models with just one RE. The same models (minus the spline) run smoothly and in just a few seconds (or minutes for the full model) when I use lmer() instead.

I was wondering if anyone had any idea why the discrepancy between gam() and lmer() and any potential solutions.

Here's some code with simulated data and the simplest of models:

library(mgcv)
library(lme4)

set.seed(1234) 
person_n <- 38000 # number of people (grouping variable)
n_j <- 15 # number of data points per person 
B1 <- 3 # beta for the main predictor
n <- person_n * n_j 

person_id <- gl(person_n, k = n_j) #creating the grouping variable
person_RE <- rep(rnorm(person_n), each = n_j) # creating the random errors

x <- rnorm(n) # creating x as a normal dist centered at 0 and sd = 1
error <- rnorm(n) 

#putting it all together
y <- B1 * x + person_RE + error
dat <- data.frame(y, person_id, x)

m1 <- lmer(y ~ x + (1 | person_id), data = dat)

g1 <- gam(y ~ x + s(person_id, bs = "re"), method = "REML", data = dat)

m1 runs in just a couple seconds on my computer, whereas g1 hits the error:

Error: vector memory exhausted (limit reached?)

Ben Bolker · Answer 1 · 2021-08-26T15:24:05.390

From ?mgcv::random.effects:

gam can be slow for fitting models with large numbers of random effects, because it does not exploit the sparsity that is often a feature of parametric random effects ... However ‘gam’ is often faster and more reliable than ‘gamm’ or ‘gamm4’, when the number of random effects is modest. [emphasis added]

What this means is that in the course of setting up the model, s(., bs = "re") tries to generate a dense model matrix equivalent to model.matrix( ~ person_id - 1); this takes (nrows x nlevels x 8 bytes/double) = (3.8e4*5.7e5*8)/2^30 = 161.4 Gb (which is exactly the object size that my machine reports it can't allocate).

Check out mgcv::gamm and gamm4::gamm4 for more memory-efficient (and faster, in this case) methods ...

Hey @ben, how did you view the object size that your machine could not allocate? I looked for memory monitoring tools for R but nothing seemed to provide that information — Mike Lavender, Mar 10 '22 at 21:47

Running random error model with mgcv gam takes too much memory

1 Answers1