7

lm sets model = TRUE by default, meaning the entire dataset used for learning is copied and returned with the fitted object. This is used by predict but creates memory overhead (example below).

I am wondering, is the copied dataset used for any reason other than predict?

Not essential to answer, but I'd also like to know of models that store data for reasons other than predict.

Example

object.size(lm(mpg ~ ., mtcars))
#> 45768 bytes
object.size(lm(mpg ~ ., mtcars, model = FALSE))
#> 28152 bytes

Bigger dataset = bigger overhead.

Motivation

To share my motivation, the twidlr package forces users to provide data when using predict. If this makes copying the dataset when learning unnecessary, it seems reasonable to save memory by defaulting to model = FALSE. I've opened a relevant issue here.

A secondary motivation - you can easily fit many models like lm with pipelearner, but copying data each time creates massive overhead. So finding ways to cut down memory needs would be very handy!

Simon Jackson
  • 3,134
  • 15
  • 24
  • 1
    Not sure if this is relevant, but the data.frame can sometimes be helpful in determining characteristics of the data that the model was created from. For example, if you fit a model like this`lm(mpg ~ drat^2, data=mtcars)`, the data.frame will give you the actual values of `mpg` and `drat` whereas otherwise you wouldn't know anything about the distribution of `drat` and whether it might take on positive or negative values because of the transformation that squares the data. – Steven M. Mortimer Jun 24 '17 at 01:38
  • If I know the distribution from the data.frame, I have a better idea of whether I'm interpolating or extrapolating from the model when making new predictions. – Steven M. Mortimer Jun 24 '17 at 01:39
  • Thanks @StevenMortimer. Although it's not exactly what I'm after, I agree it's important to know properties of the data. However, I tend to do this prior to modeling rather than fitting the model then examining the copied data. – Simon Jackson Jun 24 '17 at 03:49
  • Yep, true, but sometimes you'll end up with somebody else's model and the original data is lost or not accessible for your examination. – Steven M. Mortimer Jun 24 '17 at 05:18
  • To me, it's not frequent enough to warrant storing the data (especially when you may be fitting 1000s of models &/or based on large data sets), but I see your point. I'd hope that people document fitted models properly if passing them around like that. Another reason for my question is that many models other than `lm` point to, rather than copy the data. eg, see https://gist.github.com/drsimonj/5b2cfc428fce350676db5dc77c059052 – Simon Jackson Jun 24 '17 at 05:48

1 Answers1

4

I think model frame is returned as a protection against non-standard evaluation.

Let's look at a small example.

dat <- data.frame(x = runif(10), y = rnorm(10))
FIT <- lm(y ~ x, data = dat)
fit <- FIT; fit$model <- NULL

What is the difference between

model.frame(FIT)
model.frame(fit)

?? Checking methods(model.frame) and stats:::model.frame.lm shows that in the first case, model frame is efficiently extracted from FIT$model; while in the second case, it will be reconstructed from fit$call and model.frame.default. Such difference also results in the difference between

# depends on `model.frame`
model.matrix(FIT)
model.matrix(fit)

as model matrix is built from a model frame. If we dig further, we will see that these are different, too,

# depends on `model.matrix`
predict(FIT)
predict(fit)

# depends on `predict.lm`
plot(FIT)
plot(fit)

Note that this is where the problem could be. If we deliberately remove dat, we can not reconstruct the model frame, then all these will fail:

rm(dat)
model.frame(fit)
model.matrix(fit)
predict(fit)
plot(fit)

while using FIT will work.


This is not bad enough. The following example under non-standard evaluation is really bad!

fitting <- function (myformula, mydata, keep.mf = FALSE) {
  b <- lm(formula = myformula, data = mydata, model = keep.mf)
  par(mfrow = c(2,2))
  plot(b)
  predict(b)
  }

Now let's create a data frame again (we have removed it earlier)

dat <- data.frame(x = runif(10), y = rnorm(10))

Can you see that

fitting(y ~ x, dat, keep.mf = TRUE)

works but

fitting(y ~ x, dat, keep.mf = FALSE)

fails?

Here is a question I answered / investigated a year ago: R - model.frame() and non-standard evaluation It was asked for survival package. That example is really extreme: even if we provide newdata, we would still get error. Retaining the model frame is the only way to proceed!


Finally on your observation of memory costs. In fact, $model is not mainly responsible for potentially large lm object. $qr is, as it has the same dimension with model matrix. Consider a model with lots of factors, or nonlinear terms like bs, ns or poly, the model frame is much smaller compared with model matrix. So omitting model frame return does not help reduce lm object size. This is actually one motivation that biglm is developed.


Since I inevitably mentioned biglm, I would emphasis again that this method only helps reducing the final model object size, not RAM usage during model fitting.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Great answer! Thanks. From your answer, I see `plot` being a problem. Also, I had noticed qr taking up memory, but I think it's necessary to make predictions (whereas model frame is not)? – Simon Jackson Jun 25 '17 at 00:07