I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
- Am I understanding the interpretation of the coefficient values completely wrong?
- What is the lm actually calculating regarding the coefficients?
- Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!