2

I'm having trouble understanding effect coding with glm. As an example:

data('mpg')
mpg$trans = as.factor(mpg$trans)
levels(mpg$trans)
[1] "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)" 

For effect (or dummy) coding, glm takes the first level as reference level, so in this case it will be "auto(av)".

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        3.3827     2.3592   1.434 0.153017    
trans2        2.5827     3.6210   0.713 0.476426    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 . 

So I'm now thinking that trans1 actually is the second level ("auto(l3)"), because the first one is the reference level. To test this I relevel the factor, so that I will see the coefficient for the first level ("auto(av)"):

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
levels(mpg$trans)
"auto(l3)"   "auto(av)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

Now I'm expecting to see the coefficient of the first level and the coefficient of the second level is gone (because that is now the reference level):

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        2.5827     3.6210   0.713 0.476426    
trans2        3.3827     2.3592   1.434 0.153017    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 . 

I see that the only thing that is changed, is that the first 2 coefficients are switched, so which level is taken as reference?? what am I missing here?

W. Mooi
  • 119
  • 1
  • 3
  • 11

1 Answers1

2

You are using contr.sum where all levels are compared to the last level, and with the added constraint that all the cofficients (except the intercept) sum up to zero.

You can check it inside mpg_glm:

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
mpg_glm$contrasts

           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(av)      1    0    0    0    0    0    0    0
auto(l3)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(av)      0
auto(l3)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

You have 9 columns here which are the 9 non-intercept coefficients in your model. They are more or less the means of each level minus the last level, because of the constraint to sum to zero. The last level is redundant and not shown here:

var_means = with(mpg,tapply(hwy,trans,mean))
cbind(delta_mean = var_means[-length(var_means)]-var_means[length(var_means)],
coef = coef(mpg_glm)[-1])

           delta_mean       coef
auto(av)    3.5894737  3.3827066
auto(l3)    2.7894737  2.5827066
auto(l4)   -2.2466709 -2.4534380
auto(l5)   -3.4925776 -3.6993447
auto(l6)   -4.2105263 -4.4172934
auto(s4)    1.4561404  1.2493733
auto(s5)    1.1228070  0.9160399
auto(s6)    0.9769737  0.7702066
manual(m5)  2.0825771  1.8758101

Hence when you change the first level, you only change the order in which they appear, but the values don't change. You can easily check the contrast:

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
new_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))    
new_glm$contrasts
$trans
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(l3)      1    0    0    0    0    0    0    0
auto(av)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(l3)      0
auto(av)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

For what you describe, as changing the reference and having other levels relative this the ref, it should be contr.treatment.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks for your answer! I understand now that the last level is the reference level. How can I change the reference level so that I can also see the difference of the last level and the grand mean of this factor? – W. Mooi Feb 24 '20 at 16:17
  • Not quite sure what you mean.. Are you referring the last coefficient that is not shown? – StupidWolf Feb 24 '20 at 16:26
  • Yes, but I now see that I have to change the order of the levels like mpg$trans = factor(mpg$trans, levels = c(""....etc)). Then I can see the other coefficient :) – W. Mooi Feb 25 '20 at 08:44
  • Yup that's one way. So you manage to get what you need with the proper levels? The last coefficient, if you remember, all the non-intercept coefficients add up to zero. So you do the sum of all coefficients, and multiply it by -1 – StupidWolf Feb 25 '20 at 08:47