Understanding the Output Coefficients from a Linear Model Regression in R

Question

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.

However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.

I created a reproducible example using the ToothGrowth dataset, where the same is the case:

library(tidyverse)

# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
  as_tibble() %>%
  mutate(dose = as.factor(dose))

# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)

# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC 
#>  -5.25

# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
  summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#>    Mean
#>   <dbl>
#> 1  18.8

## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
  group_by(supp) %>%
  summarise(Mean = mean(len))) %>% 
  filter(supp == "VC")
#> # A tibble: 1 x 2
#>   supp   Mean
#>   <fct> <dbl>
#> 1 VC     17.0

## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85

# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC 
#>  FALSE

^{Created on 2021-02-23 by the reprex package (v1.0.0)}

My questions:

Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?

I hope this is enough information. If not, please don't hesitate to ask me!

score 0 · Answer 1 · answered Feb 23 '21 at 14:37

0

The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5

You can then do this verification like so:

coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)

(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)

You can read into the differences here

answered Feb 23 '21 at 14:37

erocoar

5,723
3
23
45

Thank you so much for guiding me in the right direction @erocoar! I've read what you linked, and I've come to the understanding that dummy coding is indeed not what I want for my regression model in this case. However, I can't seem to figure out how to change the way the lm-function models categorical variables and factors by any other way than dummy coding. My best guess is, that it has something to do with the 'contrast'-argument of the lm-function, but I'm pretty much on the bare bottom. Do you have any idea? I'll keep digging until and post if I find some way around the problem. – PeRiKo Feb 23 '21 at 15:46
I found the solution in: https://stats.stackexchange.com/questions/52132/how-to-do-regression-with-effect-coding-instead-of-dummy-coding-in-r The contrast argument of lm can be set to either of; contr.SAS, contr.sum, contr.treatment, contr.poly or contr.helmert. The different contrasts provides a different type of categorical variable coding and in my case, wanting to compare the mean of the dependent variable for each factor of each categorical variable to the grand mean, contr.sum did the thing. Again, so many thanks for putting me in the right direction! – PeRiKo Feb 23 '21 at 16:54
Glad to hear you found the solution :) I was also not aware of the contrast argument myself. – erocoar Feb 25 '21 at 10:21

Understanding the Output Coefficients from a Linear Model Regression in R

1 Answers1