I have a large dataset that I'd like to fit generalized additive mixed effects models to. By large, I mean >5 million rows of data. I've tried fitting this with both GAM and GAMM in the mgcv package with the following specifications:
m1 <- gamm(y~s(time, by=category, k=-1) + category, random=list(model=~1), family=betar, data=df)
m2 <- gam(y~s(time, by=category, k=-1) + category + s(model, bs='re'), family=betar, data=df)
I'm interested in fitting GAM models to account for non-linear relationships between time and y, and to quantify differences between categories. Model is included as a random effect, as each model has 5 different categories (dummy data below).
I've subset the original data (~6 million rows) to a smaller data frame (300k rows) for experimenting, but even on a machine with 128GB of RAM I'm getting issues:
nlminb problem convergence error code = 1 message = iteration limit reached without convergence (10)
and:
Error, cannot allocate vector size of 32.6GB
My question is two-fold:
are GAMMs with datasets of this size computationally tractable, and
is there anything I can do to optimize the code to make these run?
I've tried adding niterPQL=500 to the GAM models but still with convergence issues. Ultimately I'd like to include spatial autocorrelation with corSpatial(form = ~ lat + long) in the GAMM model, or s(lat,long) in the GAM model, but even in basic form I can't get the model to run.
If it helps understand the structure of the data, I've added dummy code below (with 200,000 rows):
library(mgcv)
beta <- 0.0002
n <- 10000
temp1a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="a")
temp1b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="a")
temp1c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="a")
temp1d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="a")
temp1e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="a")
temp2a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="b")
temp2b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="b")
temp2c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="b")
temp2d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="b")
temp2e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="b")
temp3a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="c")
temp3b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="c")
temp3c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="c")
temp3d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="c")
temp3e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="c")
temp4a <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="a", category="d")
temp4b <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="b", category="d")
temp4c <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="c", category="d")
temp4d <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="d", category="d")
temp4e <- data.frame(y = exp(beta * seq(n)) + rnorm(n), time = seq(n), model="e", category="d")
df <- rbind(temp1a,temp1b,temp1c,temp1d,temp1e,temp2a,temp2b,temp2c,temp2d,temp2e,temp3a,temp3b,temp3c,temp3d,temp3e,temp4a,temp4b,temp4c,temp4d,temp4e)
df$model <- as.factor(df$model)
df$category <- as.factor(df$category)
df$time <- as.numeric(df$time)
df$y <- as.numeric(df$y)
df$y [df$y == 1] <- 0.999999
df$y [df$y == 0] <- 0.000001
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
df$y <- range01(df$y)
m1 <- gamm(y~s(time, by=category, k=-1) + category, random=list(model=~1), family=betar, data=df)
m2 <- gam(y~s(time, by=category, k=-1) + category + s(model, bs='re'), family=betar, data=df)