R model.matrix drop multicollinear variables

Question

Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?

Here is an example:

library(fixest)

N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)

df <- data.frame(y = y, x1 = x1, x2 = x2)

fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
#   lm(formula = y ~ x1 + x2, data = df)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -1.82680 -0.41503  0.05499  0.67185  0.97830 
# 
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.7494     0.2885   2.598   0.0317 *  
#   x1            2.3905     0.3157   7.571 6.48e-05 ***
#   x2                NA         NA      NA       NA    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared:  0.8775,  Adjusted R-squared:  0.8622 
# F-statistic: 57.33 on 1 and 8 DF,  p-value: 6.476e-05

Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:

 model.matrix(fit_lm)
#   (Intercept)          x1          x2
#1            1  1.41175158  1.41175158
#2            1  0.06164133  0.06164133
#3            1  0.09285047  0.09285047
#4            1 -0.63202909 -0.63202909
#5            1  0.25189850  0.25189850
#6            1 -0.18553830 -0.18553830
#7            1  0.65630180  0.65630180
#8            1 -1.77536852 -1.77536852
#9            1 -0.30571009 -0.30571009
#10           1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2

The model.matrix method from fixest instead allows to drop x2:

fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept)          x1
# [1,]           1  1.41175158
# [2,]           1  0.06164133
# [3,]           1  0.09285047
# [4,]           1 -0.63202909
# [5,]           1  0.25189850
# [6,]           1 -0.18553830
# [7,]           1  0.65630180
# [8,]           1 -1.77536852
# [9,]           1 -0.30571009
# [10,]           1 -1.47296229

Is there a way to drop x2 when calling model.matrix.lm()?

score 2 · Accepted Answer · answered May 02 '22 at 18:52

2

So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:

N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)

df <- data.frame(y = y, x1 = x1, x2 = x2)

fit_lm <- lm(y ~ x1 + x2, data = df)

model.matrix2 <- function(model){
  bn <- names(na.omit(coef(model)))
  X <- model.matrix(model)
  X[,colnames(X) %in% bn]
}

model.matrix2(fit_lm)
#>    (Intercept)          x1
#> 1            1 -0.04654473
#> 2            1  2.14473751
#> 3            1  0.02688125
#> 4            1  0.95071038
#> 5            1 -1.41621259
#> 6            1  1.47840480
#> 7            1  0.56580182
#> 8            1  0.14480401
#> 9            1 -0.02404072
#> 10           1 -0.14393258

^{Created on 2022-05-02 by the reprex package (v2.0.1)}

In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.

answered May 02 '22 at 18:52

DaveArmstrong

18,377
2
13
25

That's an elegant solution, thanks! – A.Fischer May 02 '22 at 19:12
There is a situation when this does not work, I believe: sometimes `lm()` silently drops factor levels when there is multi-collinearity in two factor variables - in this case, there are no NA values in the `coefs`. I currently trying to provide a reproducible data set, so far I have only encountered this on data I cannot share here. – A.Fischer May 02 '22 at 19:29
It should work even for factor levels that get dropped silently (i.e., without `NA` coefficient values). – DaveArmstrong May 02 '22 at 20:05
You're right, I was not thinking straight yesterday evening! Thanks for your help! :) – A.Fischer May 03 '22 at 19:47
Your suggested solution is actually exactly what's implemented in the `sandwich` package for computing the `meat` of sandwich estimators! E.g. see [here](https://github.com/cran/sandwich/blob/master/R/vcovHC.R#L22) – A.Fischer May 03 '22 at 20:39

R model.matrix drop multicollinear variables

1 Answers1