I noticed when using a dummy coding for fitting my linear models R excludes certain parameters when forming model matrix. What is the R algorithm for doing this?
-
1Maybe start by reading about the `contrasts` argument to `?lm`, which will lead to `?model.matrix` and also the documentation at `?contr.treatment`. Maybe a book on linear model theory might be in order, too, since the documentation will presume that you have a basic understanding of the math. – joran Aug 23 '16 at 14:59
-
@joran I believe I do understand contrast and coding. Dummy coding is merely a way of grouping coefficients in a regression equation, but it is not clear how R makes the choice of grouping, since after all, the choice of grouping is not unique. For simple cases, I do understand the default contrasts but for complex cases my understanding seems to fall apart – Justin Thong Aug 23 '16 at 15:10
-
Then I suspect the documentation I referred to ought to be sufficient. The defaults are shown in `options("contrasts")`. – joran Aug 23 '16 at 15:13
1 Answers
It is not well documented, but it goes back to whatever pivoting algorithm the underlying LAPACK code uses:
from the source code of lm.fit:
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
...
coef <- z$coefficients
pivot <- z$pivot
...
r2 <- if(z$rank < p) (z$rank+1L):p else integer()
if (is.matrix(y)) {
....
} else {
coef[r2] <- NA
## avoid copy
if(z$pivoted) coef[pivot] <- coef
...
}
If you want to dig back further, you need to look into dqrdc2.f, which says (for what it's worth):
c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.
In practice I have generally found that R eliminates the last (rightmost) column of a set of collinear predictor variables ...

- 211,554
- 25
- 370
- 453
-
Thanks. I think you are talking about linearly dependent columns when performing QR factorization. I am talking about the stage of forming the model matrix. These two stages solve the problem of linear dependence, but I believe are slightly different. You are talking about how linearly dependent columns, which are already in the model matrix, being moved out of the rank of the matrix by pivoting. I am asking how does R decide which parameters are excluded, when forming the model matrix. NOT AFTER – Justin Thong Aug 24 '16 at 10:53