Way to extract data from lm-object before function is applied?

Question

let me directly dive into an example to show my problem:

 rm(list=ls())
 n <- 100
 df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
 fm <- lm(y ~ x1 + poly(x2, 2), data=df)

Now, I would like to have a look at the previously used data. This is almost available by using

 temp.data <- fm$model

However, x2will have been split up into poly(x2,2), which will itself be a dataframe as it contains a value for x2 and x2^2. Note that it may seem as if x2 is contained here, but since the polynomal uses orthogonal components, temp.data$x2 is not the same as df$x2. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model).

Now, to some questions:

First, and most importantly, is there a way to retrieve x2 from the lm-object in its original form. Or more generally, if some function f has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.

Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?

Third, the command head(new.dat) shows me that x2 has been split up in two components. What I see when I type View(new.dat) is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head and View? If anyone can explain, I would be highly indebted!

If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.

Thanks in advance!

Ben Bolker · Accepted Answer · 2015-07-12T19:04:18.500

Good question, but this is difficult. fm$model is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model), which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:

## 'data.frame':    100 obs. of  3 variables:
##  $ y          : num  -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
##  $ x1         : num  0.423 -1.539 -0.694 0.254 -0.13 ...
##  $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...

If you're still working in the environment from which lm was called in the first place, and if lm was called using the data argument, you can use eval(getCall(fm)$data) to get the original data. If things are being passed in and out of functions, or if someone used lm on independent objects in the environment, you're probably out of luck. If you get in trouble you can try

eval(getCall(fm)$data,environment(formula(fm))

but things rapidly start getting harder.

I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms object for the linear model -- each element in the stored model frame corresponds to an element of the terms object. I don't really understand the distinction between factors -- which are post-processed by model.matrix into sets of columns of dummy variables -- and transformed data (e.g. log(x)) or special objects like polynomial or spline bases ...

lebatsnok · Answer 2 · 2014-04-07T21:57:59.263

The question is, how badly you need it. If you look at the structure of fm$model$poly then at the end you will see something like this:

attr(,"coefs")
attr(,"coefs")$alpha
[1] 0.06738858 0.10887048

attr(,"coefs")$norm2
[1]   1.00000 100.00000  93.96666 155.01387

I suppose these coefficients could be used to restore your original data from poly. See the source code for poly function (either page(poly) or just type poly in the console) ... it looks like computing the polynomials might be reversible. But why bother doing it? I can think of two reasons: (1) you have lost the original data and the only way to restore it is this; (2) you want to understand how R computes orthogonal polynomials.

Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?

Do you mean, why is data saved with the lm object at all? Just in case, I suppose. You can easily switch it off:

fm <- lm(y ~ x1 + poly(x2, 2), data=df, model=FALSE)

Or why are the data "manipulated"? I.e., why is poly(x2,2) saved with data instead of the original x2. My understanding is that you requested this yourself. The poly(x2,x) part is first evaluated and then passed to lm, so that lm doesn't even have original x2.

edit - to answer the comment below in a more convenient way

For instance, using factor(f) for some additional factor variable does not get translated into a data frame being stored in fm$model. Only the actual variable f is being stored in fm$model, whereas in this case with poly, some transformation is stored. This puzzles me.

I think you've missed something here and the behaviour is the same for both poly and model.

> df <- data.frame(a=1:5, b=2:6, c=rnorm(5))
> fm <- lm(c~ a + factor(b), df)
> fm$model
           c a factor(b)
1  0.5397541 1         2
2  0.9108087 2         3
3  0.1819442 3         4
4 -0.9293893 4         5
5  0.1404305 5         6
> fm$model$factor
[1] 2 3 4 5 6
Levels: 2 3 4 5 6
Warning message:
In `$.data.frame`(fm$model, factor) : Name partially matched in data frame

You can see that fm$model has factor(b) instead of b, and fm$model$factor is indeed a factor, not the original integer variable. (The warning is because the name is actually factor(b) and I used factor to avoid typing something as ugly as fm$model$'factor(b)' (replace single quotes with backquotes).

The motivation is that I am trying to program a small function that takes an lm-object as argument and does some computations. It seems more memory-efficient and, quite frankly, more elegant to use the data already stored in the lm-object, which is why I tried this. R's behavior here puzzles me though. For instance, using ``factor(f)`` for some additional factor variable does not get translated into a data frame being stored in ``fm$model``. Only the actual variable ``f`` is being stored in ``fm$model``, whereas in this case with ``poly``, some transformation is stored. This puzzles me. — coffeinjunky, Apr 07 '14 at 21:42
Sorry for the late response. I don't get notified when someone edits his or her answer, so I missed it... Anyway, point taken! I just saw that the original data was there (in form of the levels of the factor), and I did indeed not notice that the data structure itself has changed. So, whenever a function is applied to a variable in a regression, `fm$model` stores the data after application of the function, so the original data is difficult to retrieve in an automatic fashion, I guess... Anyway, thanks for your answer! That was helpful indeed. — coffeinjunky, Apr 09 '14 at 16:04
The remaining bit still puzzling is that `head(new.dat)` and `View(new.dat)` show different things. If you have any idea what might be going on there... — coffeinjunky, Apr 09 '14 at 16:08
look at `str(new.dat)` -- you will see that this is not a simple data frame because one of its "columns" is actually a 2-dimensional object (class `poly` but actually a matrix with some extra attributes). It appears that RStudio is just not good at showing such data, head() is right and View() is wrong. Compare `df1 <- data.frame(a=1:5, b=matrix(1:10, ncol=2))` with `df2 <- data.frame(a=1:5, b=I(matrix(1:10, ncol=2)))` with `str`, `View`, and `head` -- View(df2) is particularly strange and misleading. — lebatsnok, Apr 09 '14 at 20:10
Ah, so it is an RStudio-thing. Thanks, that I did not imagine. I thought some internal R things is going on... — coffeinjunky, Apr 10 '14 at 13:58

Way to extract data from lm-object before function is applied?

2 Answers2

edit - to answer the comment below in a more convenient way