2

I am estimating a regression model with some factor/categorial variables and some numerical ones. Is it possible to display the reference category for each factor/categorial variable in the summary of the regression model?

Ideally this would translate also to texreg or stargazer to have latex output, but having them in the summary of the regression would already be a good start.

Does anybody have an Idea, what am I missing?

SKupek
  • 63
  • 6
  • The reference category values are actually included in the intercept value, quite more explained here: https://stats.stackexchange.com/questions/94010/understanding-dummy-manual-or-automated-variable-creation-in-glm – Tur Dec 23 '21 at 09:19

2 Answers2

3

The reference level is the one that is missing in the summary, because the coefficients of the other levels are the contrasts to the reference level, i.e. the intercept actually represents the mean in the reference category.

iris <- transform(iris, Species_=factor(Species))  ## create factor

summary(lm(Sepal.Length ~ Petal.Length + Species_, iris))$coe
#                    Estimate Std. Error   t value      Pr(>|t|)
# (Intercept)         3.6835266 0.10609608 34.718780 1.968671e-72
# Petal.Length        0.9045646 0.06478559 13.962436 1.121002e-28
# Species_versicolor -1.6009717 0.19346616 -8.275203 7.371529e-14
# Species_virginica  -2.1176692 0.27346121 -7.743947 1.480296e-12

You could remove the intercept, to get the missing level displayed, but that makes not much sense. You then just get the means of each level without a reference, however you are interested in the contrast between the reference level and the other levels.

summary(lm(Sepal.Length ~ 0 + Petal.Length + Species_, iris))$coe
#                     Estimate Std. Error   t value     Pr(>|t|)
# Petal.Length       0.9045646 0.06478559 13.962436 1.121002e-28
# Species_setosa     3.6835266 0.10609608 34.718780 1.968671e-72
# Species_versicolor 2.0825548 0.28009598  7.435147 8.171219e-12
# Species_virginica  1.5658574 0.36285224  4.315413 2.921850e-05

If you're not sure, the reference level is always the first level of the factor.

levels(iris$Species_)[1]
# [1] "setosa"

To prove that, specify a different reference level and see if it's first.

iris$Species_ <- relevel(iris$Species_, ref='versicolor')

levels(iris$Species_)[1]
# [1] "versicolor"

It is common to refer to the reference level in a note under the table in the report, and I recommend that you do the same.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you very much for your answer. Of course, it is an option to just explain it below the regression report. Creating a empty line or something similar displaying the reference category therefore is not easily possible ? – SKupek Dec 23 '21 at 09:48
  • @SKupek Everything is possible :) You could rewrite the `summary.lm` method, or add a respective line in `texreg` or similar. However, I very much doubt that this effort is worth it. – jay.sf Dec 23 '21 at 10:06
  • Thank you for your answer. Yeah I think thats maybe too much effort for now. Thanks – SKupek Dec 23 '21 at 11:42
1

For LaTeX output or similar it is easily possible to add a line in the modelsummary package. (For example to display you reference category)

library(modelsummary)
library(tibble)    

data(mtcars)

models <- list()
models[['OLS']] <- lm(mpg ~ factor(cyl), mtcars)
models[['Logit']] <- glm(am ~ factor(cyl), mtcars, family = binomial)

    rows <- tribble(~term,          ~OLS,  ~Logit,
                    'factor(cyl)4', '-',   '-',
                    'Info',         '???', 'XYZ')
        attr(rows, 'position') <- c(3, 9)
        
modelsummary(models, add_rows = rows)

See here for details:

https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html#add_rows

SKupek
  • 63
  • 6