What is R algorithm for dummy coding model matrix?

Question

I noticed when using a dummy coding for fitting my linear models R excludes certain parameters when forming model matrix. What is the R algorithm for doing this?

Maybe start by reading about the `contrasts` argument to `?lm`, which will lead to `?model.matrix` and also the documentation at `?contr.treatment`. Maybe a book on linear model theory might be in order, too, since the documentation will presume that you have a basic understanding of the math. — joran, Aug 23 '16 at 14:59
@joran I believe I do understand contrast and coding. Dummy coding is merely a way of grouping coefficients in a regression equation, but it is not clear how R makes the choice of grouping, since after all, the choice of grouping is not unique. For simple cases, I do understand the default contrasts but for complex cases my understanding seems to fall apart — Justin Thong, Aug 23 '16 at 15:10
Then I suspect the documentation I referred to ought to be sufficient. The defaults are shown in `options("contrasts")`. — joran, Aug 23 '16 at 15:13

score 2 · Answer 1 · answered Aug 23 '16 at 16:22

It is not well documented, but it goes back to whatever pivoting algorithm the underlying LAPACK code uses:

from the source code of lm.fit:

z <- .Call(C_Cdqrls, x, y, tol, FALSE)
...
coef <- z$coefficients
pivot <- z$pivot
...
r2 <- if(z$rank < p) (z$rank+1L):p else integer()
if (is.matrix(y)) {
    ....
} else {
    coef[r2] <- NA
    ## avoid copy
    if(z$pivoted) coef[pivot] <- coef
    ...
}

If you want to dig back further, you need to look into dqrdc2.f, which says (for what it's worth):

c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.

In practice I have generally found that R eliminates the last (rightmost) column of a set of collinear predictor variables ...

Thanks. I think you are talking about linearly dependent columns when performing QR factorization. I am talking about the stage of forming the model matrix. These two stages solve the problem of linear dependence, but I believe are slightly different. You are talking about how linearly dependent columns, which are already in the model matrix, being moved out of the rank of the matrix by pivoting. I am asking how does R decide which parameters are excluded, when forming the model matrix. NOT AFTER — Justin Thong, Aug 24 '16 at 10:53

What is R algorithm for dummy coding model matrix?

1 Answers1