1

I have a large dataset that I'd like to fit generalized additive mixed effects models to. By large, I mean >5 million rows of data. I've tried fitting this with both GAM and GAMM in the mgcv package with the following specifications:

m1 <- gamm(y~s(time, by=category, k=-1) + category, random=list(model=~1), family=betar,  data=df)
m2 <- gam(y~s(time, by=category, k=-1) + category + s(model, bs='re'), family=betar, data=df)

I'm interested in fitting GAM models to account for non-linear relationships between time and y, and to quantify differences between categories. Model is included as a random effect, as each model has 5 different categories (dummy data below).

I've subset the original data (~6 million rows) to a smaller data frame (300k rows) for experimenting, but even on a machine with 128GB of RAM I'm getting issues:

nlminb problem convergence error code = 1 message = iteration limit reached without convergence (10)

and:

Error, cannot allocate vector size of 32.6GB

My question is two-fold:

  1. are GAMMs with datasets of this size computationally tractable, and

  2. is there anything I can do to optimize the code to make these run?

I've tried adding niterPQL=500 to the GAM models but still with convergence issues. Ultimately I'd like to include spatial autocorrelation with corSpatial(form = ~ lat + long) in the GAMM model, or s(lat,long) in the GAM model, but even in basic form I can't get the model to run.

If it helps understand the structure of the data, I've added dummy code below (with 200,000 rows):


library(mgcv)

beta <- 0.0002
n <- 10000

temp1a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="a")
temp1b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="a")
temp1c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="a")
temp1d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="a")
temp1e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="a")

temp2a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="b")
temp2b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="b")
temp2c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="b")
temp2d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="b")
temp2e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="b")

temp3a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="c")
temp3b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="c")
temp3c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="c")
temp3d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="c")
temp3e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="c")

temp4a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="d")
temp4b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="d")
temp4c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="d")
temp4d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="d")
temp4e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="d")

df <- rbind(temp1a,temp1b,temp1c,temp1d,temp1e,temp2a,temp2b,temp2c,temp2d,temp2e,temp3a,temp3b,temp3c,temp3d,temp3e,temp4a,temp4b,temp4c,temp4d,temp4e)
df$model <- as.factor(df$model)
df$category <- as.factor(df$category)
df$time <- as.numeric(df$time)
df$y <- as.numeric(df$y)

df$y [df$y == 1] <- 0.999999
df$y [df$y == 0] <- 0.000001


range01 <- function(x){(x-min(x))/(max(x)-min(x))}
df$y <- range01(df$y)


m1 <- gamm(y~s(time, by=category, k=-1) + category, random=list(model=~1), family=betar,  data=df)
m2 <- gam(y~s(time, by=category, k=-1) + category + s(model, bs='re'), family=betar, data=df)
Thomas Moore
  • 192
  • 1
  • 12
  • 1
    [`bam`](https://stat.ethz.ch/R-manual/R-patched/library/mgcv/html/bam.html) might be worth a shot – user20650 Mar 29 '21 at 11:56
  • Wow. Can't believe I missed that one. I'll try that now in place of the m2 gam model, but I'd still like to get m1 gamm running for the mixed effects. – Thomas Moore Mar 29 '21 at 12:33

0 Answers0