3

I am estimating a very simple model on a large dataset. The formula looks like

 scam::scam(formula = ratio ~ s(rate,bs="mpi"))

These models are then used to generate predictions for new data. I do not care about anything else about the model.

My problem is that the returned object is huge (a few GB), which tends to lead to problems downstream.

I believe this is due to the fact that scam and gam save the fitted values of each of the million of records.

Is there a way to only save a small object containing the minimum required to predict on new data? This should not be bigger than a few kilobytes.

huge thanks!

edit1: here is a reproducible example to show my understanding of Gavin's answer:

library(mgcv)
data(iris)
library(tidyverse)
mydb <- iris %>% filter(Species == "setosa")

dim(mydb) # 50 records
model <-  mgcv::gam(formula = Sepal.Length ~ s(Sepal.Width,bs="cs"), 
                     data  = mydb)

print(object.size(model), units = "KB") # 78 KB

distinct_mydb <- mydb %>% distinct(Sepal.Width) # 16 distinct values for the independent variables
Xp <- predict(model, newdata= distinct_mydb, type = "lpmatrix")
coefs <- coef(model)
dim(Xp) # 16 records and 10 columns (one for each of the 10 knots of the spline?)
preds1 <- Xp %*% coefs %>% t()  
preds2 <- predict(model, newdata= distinct_mydb)  # preds 1 and preds2 are identical

print(object.size(Xp), units = "KB")   # 3.4 Kb
print(object.size(coefs), units = "KB") # 1.1 Kb

In this solution, I would save "Xp" (3.4 Kb) and "coefs" (1.1Kb) for a total of 4.5 Kb instead of saving "model" which takes up 78 Kb

What I am unsure is how I could use Xp and coefs next week to predict the Sepal.Length of a flower with a never-seen-before Sepal.Width of 2.5 ?

edit2 : Is the answer simply to generate a grid of all possible Sepal.Width (rounded to some decimal) and just left_join this table with any future data?

fake_db <- data.frame(Sepal.Width = seq(0,max(mydb$Sepal.Width), by = 0.1))
fake_db$predicted_Sepal.Length = predict(model, newdata =  fake_db)
print(object.size(fake_db), units = "KB") # 4.3 Kb
Zoltan
  • 760
  • 4
  • 15

1 Answers1

2

Look at ?mgav:::predict.gam and the information for argument type and in particular "lpmatrix".

For example you only need the coefficient vector and the output from

predict(model, newdata, type = "lpmatrix")`

where newdata is a much smaller subset of your original data but covering the ranges of the covariates.

This option "lpmatrix" is designed for use downstream or outside of R. The general idea is that given "lpmatrix" as Xp then Xp %*% coef(model) gives fitted values. But as you can reduce the size of Xp via newdata you can reduce the dimensionality of the object needed for prediction.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • I've read a lot of your responses to this general question and I think I'm still missing something. Is this correct? `xp <- predict(original_model, newdata = original_data, type = 'lpmatrix')` `coefs <- coef(model)` `fitted_values <- xp %*% coefs %>% t()` Using this pattern, how would I introduce a new batch of data to be scored? Do I generate a new lpmatrix with every new batch, or is there a lpmatrix associated with the original training data that can be stored and applied to each subsequent batch of data? – Raphael K Feb 16 '20 at 15:17
  • Wait, I think I get it. For each new batch of data you generate `Xp` using `predict()`, and then multiply that by `coefs` to get the scores. – Raphael K Feb 18 '20 at 15:31