I am estimating a very simple model on a large dataset. The formula looks like
scam::scam(formula = ratio ~ s(rate,bs="mpi"))
These models are then used to generate predictions for new data. I do not care about anything else about the model.
My problem is that the returned object is huge (a few GB), which tends to lead to problems downstream.
I believe this is due to the fact that scam and gam save the fitted values of each of the million of records.
Is there a way to only save a small object containing the minimum required to predict on new data? This should not be bigger than a few kilobytes.
huge thanks!
edit1: here is a reproducible example to show my understanding of Gavin's answer:
library(mgcv)
data(iris)
library(tidyverse)
mydb <- iris %>% filter(Species == "setosa")
dim(mydb) # 50 records
model <- mgcv::gam(formula = Sepal.Length ~ s(Sepal.Width,bs="cs"),
data = mydb)
print(object.size(model), units = "KB") # 78 KB
distinct_mydb <- mydb %>% distinct(Sepal.Width) # 16 distinct values for the independent variables
Xp <- predict(model, newdata= distinct_mydb, type = "lpmatrix")
coefs <- coef(model)
dim(Xp) # 16 records and 10 columns (one for each of the 10 knots of the spline?)
preds1 <- Xp %*% coefs %>% t()
preds2 <- predict(model, newdata= distinct_mydb) # preds 1 and preds2 are identical
print(object.size(Xp), units = "KB") # 3.4 Kb
print(object.size(coefs), units = "KB") # 1.1 Kb
In this solution, I would save "Xp" (3.4 Kb) and "coefs" (1.1Kb) for a total of 4.5 Kb instead of saving "model" which takes up 78 Kb
What I am unsure is how I could use Xp and coefs next week to predict the Sepal.Length of a flower with a never-seen-before Sepal.Width of 2.5 ?
edit2 : Is the answer simply to generate a grid of all possible Sepal.Width (rounded to some decimal) and just left_join this table with any future data?
fake_db <- data.frame(Sepal.Width = seq(0,max(mydb$Sepal.Width), by = 0.1))
fake_db$predicted_Sepal.Length = predict(model, newdata = fake_db)
print(object.size(fake_db), units = "KB") # 4.3 Kb