Can the subset() function within the lm() R function can be used to remove observations only of certain variables?

Question

I am not sure my question makes sense. But, I am considering modifying an econometrics model using time series data. It is a multiple regression. One of the independent variables is the 5 year Treasury rate. This variable is split over two time periods. One variable is the 5 year Treasury rate from 1950 to 1986. After 1986 this variable takes the value of 0. The second one is 5 year Treasury rate from 1986 to the present. Before 1986, this second variable has values of 0. Someone suggested I replace the 0 values with blanks (equivalent to missing data). Because as suggested, those variables' meanings would be supposedly better specified. Could you do that with the subset() function. In other words, could you in effect remove or ignore the 0 values from those variables without actually removing or ignoring the entire row of data, and remove all the values from the other independent variables. I know this coding question is contingent on whether this process even makes sense. I am not sure it does. I have passed the theoretical question by Cross Validated. But, I am not sure I will get any answer. I figured I would go ahead and ask the coding question here.

Is the point of this that you want to treat 1950-1986 and 1986-present as different periods? If that is the case you can create a dummy categorical variable that has two values, for instance "pre" (for all rows prior to 1986) and "post" (for all rows after 1986), and then just include that in your regression. — jlhoward, Sep 13 '15 at 04:05

score 2 · Accepted Answer · answered Sep 13 '15 at 00:43

2

Assuming your data is in a data frame, the answer is "no." You cannot use subset on only part of a data.frame. That's because subset on a data frame returns another data frame, and in a data frame all of the variables must be the same length.

There are plenty of ways to work around this restriction, but they won't work with lm. Think about how regression works: every observation must be fully observed. If you have missing data, you have three options:

Delete the observations with missing data. This is called listwise deletion and it is the default in lm (by way of the na.omit function, buried inside the model.matrix function, which is inside lm)
Impute the missing data. This is a massive field and and area of active research
Use some kind of other method, like a Bayesian model that can integrate over the missing data

You should be able to get help in this area from Cross Validated. But the fact remains, there is simply no way to use lm on variables of unequal length, and there is no way to get subset to return a data frame containing variables of unequal length because all variables in a data frame must be the same length.

answered Sep 13 '15 at 00:43

shadowtalker

12,529
3
53
96

Thanks, this was very helpful. – Sympa Sep 13 '15 at 05:12
ssdecontrol, I assume the same would be true in SAS with the PROCREG procedure, or whatever it is called. In other words, you can't do a multiple regression with variables with a different number of observations. Given that, no software is going to handle what is unfeasible. I think this is the case, because multiple regressions are resolved using matrix algebra that requires this fundamental property [that each variable has the same number of observation]. – Sympa Sep 13 '15 at 15:27
@GaetanLion correct. It's not even because of the matrix algebra. The problem exists even in the simplest undergrad econometrics case of explicitly optimizing `y = a + b1x1 + b2x2` by taking the derivative. The regression problem just isn't defined for unequal variable lengths. – shadowtalker Sep 13 '15 at 18:18
that was a very helpful comment. Thanks, I gave you a 1 up vote and best answer for your contribution. – Sympa Sep 13 '15 at 20:24

Can the subset() function within the lm() R function can be used to remove observations only of certain variables?

1 Answers1