9

I'm trying to run lm() on only a subset of my data, and running into an issue.

dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data

lm( y ~ ., dt) # Use all x: Works
lm( y ~ ., dt[x3 == 'men']) # Use all x, limit to men: doesn't work (as expected)

The above doesn't work because the dataset now has only men, and we therefore can't include x3, the gender variable, into the model. BUT...

lm( y ~ . -x3, dt[x3 == 'men']) # Exclude x3, limit to men: STILL doesn't work
lm( y ~ x1 + x2, dt[x3 == 'men']) # Exclude x3, with different notation: works great

This is an issue with the "minus sign" notation in the formula? Please advice. Note: Of course I can do it a different way; for example, I could exclude the variables prior to putting them into lm(). But I'm teaching a class on this stuff, and I don't want to confuse the students, having already told them they can exclude variable using a minus sign in the formula.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Zhaochen He
  • 610
  • 4
  • 12
  • 3
    It's interesting that both `model.matrix(y ~ . - x3, data = dt[x3 == "men"])` and `model.matrix(y ~ x1 + x2, data = dt[x3 == "men"])` work (`lm` calls `model.matrix` internally). The only difference between both model matrices is a `"contrasts"` attribute (which still contains `x3`) and which gets picked up later on within the `lm` routine, likely causing the error you're seeing. So my feeling is that the issue has to do with how `model.matrix` creates and stores the design matrix when removing terms. – Maurits Evers Feb 12 '20 at 23:33
  • I was trying to "expand" the `.` to get a simplified formula with `terms(y ~ . -x3, data=dt, simplify=TRUE)` but oddly it still retains `x3` in the variables attribute which trips up `lm` – MrFlick Feb 12 '20 at 23:36
  • 1
    @MrFlick - it looks like the unimplemented-in-R `neg.out=` option might be related. From the S help files for `terms`, where `neg.out=` is implemented: *flag controlling the treatment of terms entering with "-" sign. If TRUE, terms will be checked for cancellation and otherwise ignored. If FALSE, negative terms will be retained (with negative order).* – thelatemail Feb 12 '20 at 23:50
  • 1
    @MauritsEvers: `lm` calls `model.matrix` on a modified version of the data. At the very beginning, `lm` composes and evaluates the following expression: `mf <- stats::model.frame( y ~ . -x3, dt[x3=="men"], drop.unused.levels=TRUE )`. This causes `x3` to become a single-level factor. `model.matrix()` is then called on `mf`, not the original data, resulting in the error we're observing. – Artem Sokolov Feb 13 '20 at 16:37
  • @ArtemSokolov but the `-x3` in the formula should exclude `x3` from the dataframe, so it doesn't matter whether it's single level or not. Why it doesn't exclude it? – robertspierre Jun 14 '22 at 19:51

1 Answers1

2

The error you are getting is because x3 is in the model with only one value = "men" (see comment below from @Artem Sokolov)

One way to solve it is to subset ahead of time:

dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data

dmen<-dt[x3 == 'men'] # create a new subsetted dataset with just men

lm( y ~ ., dmen[,-"x3"]) # now drop the x3 column from the dataset (just for the model)

Or you can do both in the same step:

lm( y ~ ., dt[x3 == 'men',-"x3"])
Dylan_Gomes
  • 2,066
  • 14
  • 29
  • Overall, this is a nice solution. One thing to correct is that `-x3` in a formula does *not* cause `lm` to think that you're trying to subtract the column. The "don't use x3 in the model" intent is communicated correctly, but the issue is that `lm` calls `model.frame( ..., drop.unused.levels=TRUE )` causing `x3` to become a single-level factor, leading to downstream problems in `model.matrix()`. – Artem Sokolov Feb 13 '20 at 16:42
  • Thanks for clarification Artem Sokolov, I have taken that incorrect explanation out of my answer. – Dylan_Gomes Feb 13 '20 at 17:26