1

I can generate predictions for a fitted linear model by assigning the output of lm() to a name, like fit_lm, and then use predict() with that fit to generate predictions on newdata (see reprex below).

With big regressions, lm() objects can become large, as they carry with them the original data with which they were fit, as well as some other potentially large pieces of data. When I am doing this in an automated fashion over many datasets, the individual lm objects can take up a lot of space, and I don't want to carry around the whole lm object. I would like to extract a predictive function from the fit that I can store and use for prediction. Is there an easy way to just extract/construct from the fit a function that does prediction? At the very bottom of my reprex in comments is an example of how I envision the code working.

# Do a lm fit
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08

# Predict it
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# Like to see that I could extract the fit as a function that could be used:
#
# f <- regressionFunction(fit)
# vector_of_fits <- f(data.frame(x = 5:6))
#
# vector_of_fits would equal: 
#>        1        2 
#> 12.83658 14.83351

Created on 2020-01-07 by the reprex package (v0.3.0)

mpettis
  • 3,222
  • 4
  • 28
  • 35

2 Answers2

6

First, we borrow a function from this other question that reduces the size of the lm object.

clean_model = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()

  # also try and avoid some large environments
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()

  cm
}

Then write a simple wrapper that reduces the model and returns the prediction function:

prediction_function <- function(model) {
  stopifnot(inherits(model, 'lm'))
  model <- clean_model(model)
  function (...) predict(model, ...)
}

Example:

set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
f <- prediction_function(fit)
f(data.frame(x = 5:6))
       1        2 
12.83658 14.83351 

Check sizes:

object.size(fit)
# 16648 bytes

object.size(prediction_function)
# 8608 bytes

For this small example we save half the space.

Let's use some larger data:

data(diamonds, package = 'ggplot2')

fit2 <- lm(carat ~ price, diamonds)
predict(fit2, data.frame(price = 200))
f2 <- prediction_function(fit2)
f2(data.frame(price = 200))

print(object.size(fit2), units = 'Mb'); 
object.size(f2)

Now we go from 13 Mb to 5376 bytes.

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • Just what I was looking for, thanks! I looked for some lm-object cleaning posts, didn't see this one... – mpettis Jan 08 '20 at 01:24
  • This seems, in my experience, to be so common that it would find it's way into the stats package, or a common CRAN package... – mpettis Jan 08 '20 at 01:26
  • 1
    Some alternate packages my provide linear model functions with smaller objects, such as `RcppArmadillo` or `biglm` (that's big data, not big object). – Axeman Jan 08 '20 at 19:16
  • I did look at biglm, but I got a little skittish when I further googled and saw different results of using biglm and just lm. I didn't dig deeper on the differences, thought I'd go this route... example: https://stats.stackexchange.com/questions/255520/why-do-lm-and-biglm-in-r-give-different-p-values-for-the-same-data – mpettis Jan 09 '20 at 16:11
1

this is an answer using the helpful broom package to tidy model output.

library(broom)
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# store model coef in data frame using broom
model_params <- tidy(fit)
model_params
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)     2.85    0.403       7.07 0.000199    
#> 2 x               2.00    0.0717     27.9  0.0000000198

# create function to predict from model params
predict_from_params <- function(x, model_params){
  model_params[1,]$estimate + x * model_params[2,]$estimate
  }

predict_from_params(df$x, model_params)
#> [1]  4.848859  6.845790  8.842720 10.839651 12.836581 14.833512 16.830442
#> [8] 18.827373 20.824303
Greg B
  • 86
  • 4
  • Thank you for this. This works exactly with my particular example of a one-factor regression. I'm accepting the other one because it generalizes to more than that. Thanks again! – mpettis Jan 08 '20 at 01:25