Understanding why linear regression isn't treating my categorical variable as expected?

Question

I'm reading about formulas and linear regression, and I'm having trouble understanding how to interpret the output of lm for a linear regression with multiple parameters and categorical variables.

I think I understand how to interpret the output for a simple y = a + bx formula (correct me if what I'm saying below is wrong).

#library(tidyverse)
#library(modelr)
require(ggplot2)
require(dplyr)

diamonds2 <- diamonds %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

mod <- lm(
  lprice ~ lcarat,
  data = diamonds2
)

#diamonds2 %>% modelr::add_predictions(mod, "pred_price")
diamonds2$pred_price <- predict(mod, diamonds2) # if you don't have modelr

The model (mod) is

Call:
lm(formula = lprice ~ lcarat, data = diamonds2)

Coefficients:
(Intercept)       lcarat  
     12.189        1.676

As I understand it, that means that when I add predictions, my formula to generate the predictions is

pred_price = 12.189 + (1.676 * lcarat)

I get confused when I add a categorical variable to my formula

diamonds2 <- diamonds %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

mod <- lm(
  lprice ~ lcarat + cut,   # I added a categorical variable here
  data = diamonds2
)

diamonds2 %>%
  add_predictions(mod, "pred_price")

Now the model is

Call:
lm(formula = lprice ~ lcarat + cut, data = diamonds2)

Coefficients:
(Intercept)       lcarat        cut.L        cut.Q        cut.C        cut^4  
   12.10711      1.69577      0.32364     -0.09583      0.07631      0.02688

I'm confused about a few things.

1) diamonds$cut has five possible values (fair, good, very good, premium, ideal), so why does the model only show four values for cut?

2) From my understanding, R treats a categorical variable as being either 1 or 0 in the linear regression equation, so each "cut" coefficient will either be multiplied by 1 or 0 when evaluating a data row. Is that correct?

3) How do I write a y = a_0 + (a_1 * x_1) + (a_2 * x_2)... from that coefficients given above? Is that possible in this case?

Statistics questions are more on-topic over at [CrossValidated](https://stats.stackexchange.com) — smci, May 07 '17 at 22:40
Also, make your code example fully reproducible, as in open a new R session and make sure it actually executes. You're missing a `require(ggplot2)` which is the library where diamonds dataset is, and `require(dplyr)` for the pipe and mutate operations. — smci, May 07 '17 at 22:51
Your code is still breaking because of unknown function `add_predictions()`, which is apparently from modelr, which I don't have installed. You have to make your code reproducible. — smci, May 07 '17 at 22:55
You don't really need modelr, that line simply should be `diamonds2$pred_price <- predict(mod, diamonds2)` — smci, May 07 '17 at 22:59
Ok. I'm learning this from a book, and that's where I got the add_predictions from. — Ben Rubin, May 07 '17 at 23:00
No problem. Minimal Reproducible Example doesn't just mean shortest code, it also means no unnecessary libraries. We don't need tidyverse at all, and we can avoid modelr with the equivalent line above. — smci, May 07 '17 at 23:05
So yeah, lm is treating cut as integer although it shouldn't. Looking into it... — smci, May 07 '17 at 23:07
Ah, here's the root-cause: if you type `options(contrasts)` you'll see the default behavior for contrasts on categoricals is `unordered:"contr.treatment" ordered:"contr.poly"` — smci, May 07 '17 at 23:15
Fixed this. By clobbering `diamonds$cut <- factor(diamonds$cut, ordered=F)`. Now you get your factor contrast levels! — smci, May 07 '17 at 23:20
I totally agree. R is frequently as maddening as it's powerful, and the scanty error-messages don't help (or actively misdirect). I'd save the old ordered-categorical value first with `diamonds$cut.ordered <- diamonds$cut`. I expect the technically correct answer would be to modify `options(contrasts)` for `ordered` to also be `contr.treatment` — smci, May 07 '17 at 23:27
Some further reading you might like: [Setting and Keeping Contrasts](http://faculty.nps.edu/sebuttre/home/R/contrasts.html). The bottom line is you can set `options('contrasts')` to whatever string list of two contrast function names you want. (You could even write a custom contrast function, say you wanted to control the choice of baseline level in `contr.treatment`) — smci, May 07 '17 at 23:57

smci · Accepted Answer · 2017-05-14T07:31:23.113

5

1) lm did something unwanted, it treated diamonds$cut variable as an ordered categorical instead of a categorical (i.e. didn't use the usual 1/0 dummy-variable contrasts treatment).

Initially I thought you simply needed to make sure lm got a categorical, either by writing lm(formula = lprice ~ lcarat + factor(cut)) or fixing the dataframe diamonds2$cut <- factor(diamonds2$cut).

You were expecting to see contrast levels for your categorical cut. (A 5-level categorical would give 4 contrasts (and an intercept); see the documentation for help(contrasts). However you didn't get contrast-levels from lm, you got polynomial coefficients.

Digging into why this happened, we note that str(cut) tells us cut is an ordered categorical (this is the culprit):

> str(diamonds$cut)
 Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...

Digging further into why lm did, this I looked at the pages for help(contrast), help(lm) and help(model.matrix.default), which led me to options('contrasts'):

> getOption('contrasts')
        unordered           ordered 
"contr.treatment"      "contr.poly"

This means the default behavior of lm for generating contrasts on an ordered categorical (like diamonds$cut), is the unwanted contr.poly(). So either change the default, or convert cut to unordered categorical, or create a new var diamonds$cut <- factor(diamonds$cut, ordered=F) The code for both solutions is at the bottom.

2) No, that's what would happen if you passed diamonds$cut as a categorical. However you got cut.L, .Q, .C, ^4... which are linear, quadratic, cubic, quartic coefficients of (an unwanted) polynomial in the values of cut which is trying to fit to the observed values of lcarat (see help on contrasts which is calling contr.poly and using n=4 levels). That's not what you want.

Another telltale is those levels should have been named cut.Fair, cut.Good, cut.Very_Good, cut.Premium, cut.Ideal (well you'll get 4 out of those 5; the other will be dropped).

3) Once you fix it to treat cut as (unordered) factor, you should get: lprice = coeff.price * lcarat + coeff.Fair * cut.Fair + coeff.Good * cut.Good + ... + coeff.Ideal * cut.Ideal

FIX/WORKAROUND 1:

# Save the old ordered-categorical, then clobber it with the unordered one so that lm() Does The Right Thing (tm)
diamonds$cut.ordered <- diamonds$cut
diamonds$cut <- factor(diamonds$cut, ordered=F)

OR ELSE FIX 2:
# Make lm treat all categoricals as unordered categoricals, even ordered ones
#options('contrasts' = c('contr.treatment','contr.treatment') )

lm(log(price) ~ log(carat) + cut, data=diamonds)

Coefficients:
 (Intercept)    log(carat)       cutGood  cutVery Good    cutPremium  
      8.2001        1.6958        0.1632        0.2408        0.2382  
    cutIdeal  
      0.3172

edited May 14 '17 at 07:31

answered May 07 '17 at 22:04

smci

32,567
20
113
146

1

I tried `diamonds2$cut <- factor(diamonds2$cut)`, and `is.factor(diamonds2$cut)` returns TRUE. However, I still get the `cut.L`, `cut.Q`... stuff. I also tried `mod <- lm(lprice ~ lcarat + factor(cut), data = diamonds2)`, but that causes my model to be `factor(cut).L`, `factor(cut).Q`... Do you know what's going on? – Ben Rubin May 07 '17 at 22:49
Tip: always run `summary(mod)` after generating a model, and eyeball the results and output coefficients as a quick sanity-check. – smci May 07 '17 at 23:01
Will do. `summary(mod)` gives me the same `cut.L` and `factor(cut).L` results. – Ben Rubin May 07 '17 at 23:14
This is incorrect. If lm() were treating cut as an integer, there would be exactly one coefficient for it (as there is for carat), not four. L, Q, C, and ^4 refer to polynomial trends, yes, but on the categorical factor cut, not on an integer version of it. contr.poly on a factor produces orthogonal contrast codes, which is a completely fine way to enter a factor into a linear model. – Rose Hartman May 07 '17 at 23:36
@RoseHartman: it is correct. I already referenced that the quadratic, cubic, quartic coeffts were caused by `contr.poly` with n=4, and I referred the OP to the manpage `?contr.poly` – smci May 07 '17 at 23:39
We could debate whether polynomials on an ordered categorical which is not an interval (e.g. cut-levels of a diamond) are meaningless if the ordered values are arbitrary/subjective and the intervals are not quantifiable. R factor() does not have an argument to distinguish between ordinal and interval Clearly, the default contr.poly misled the book author, their proofreader and thousands of readers... – smci May 07 '17 at 23:44
1

@RoseHartman: I really don't appreciate your driveby downvote for an answer with a full diagnosis and two code-fixes which took me an hour to research. – smci May 07 '17 at 23:45
1

@smci I appreciate all the time you spent answering this. You gave me a much better understanding of how `lm` works. – Ben Rubin May 07 '17 at 23:49
@smci I certainly don't mean to suggest your answer has no value (lots of useful references, etc.), but the main thrust (the first sentence, bolded: "treated diamonds$cut as an integer variable instead of a factor") is incorrect, and I'm afraid misleading. An integer is not the same as an ordered factor in R. The fixes you provide change the way lm handles the contrasts for that variable, but it's treated as a factor in both cases. – Rose Hartman May 08 '17 at 01:02
@RoseHartman: 'This is incorrect' is pretty unreasonable blanket verdict to write on an answer which you 99% agree with, just quibble four words in one clause. I already edited my answer four times to try to please you. Like the OP (and most R users) I had never encountered a regression on an ordered categorical, so *"lm appears to be treating the var like an integer"* is slightly sloppy but correct, the vast majority of us do not expect to see fourth-order polynomials on our factor vars. (if we more correctly said *"like an ordered categorical"*, that wouldn't tell people anything would it) – smci May 14 '17 at 07:26
@smci I'm sorry to have upset you. I appreciate your edits, but want to clarify that I don't "99% agree" with your answer. In most cases, polynomial contrasts are a much more sensible way to treat ordered factors than traditional dummy coding (which is why R uses `contr.poly` as the default for such variables). Your answer is improved, but it still casts the behavior OP observed as a bug ("unwanted"), which is misleading. I'll undo my downvote since the most problematic parts have been corrected, but the general idea (`contr.poly` should be "fixed" to `contr.treatment`) is counterproductive. – Rose Hartman May 15 '17 at 17:08
@smci Just a quick response to an assertion in your comment: "'lm appears to be treating the var like an integer' is slightly sloppy but correct" --- this is a serious misunderstanding. An integer in a regression is estimated with one degree of freedom, whereas a factor with k levels is estimated with k-1 separate effects. That's a huge and very meaningful difference. Try running OP's model with `as.numeric(cut)` and you'll see the two models are apples and oranges. I'm pointing this out to be constructive; I hope you take it in that spirit. – Rose Hartman May 15 '17 at 17:20
@RoseHartman: first, the blanket statement 'This is incorrect' as a response to something, 99% of which is correct by you, is profoundly unhelpful, especially when inviting other people to downvote. Comments are required to be constructive. Second, I had already edited my answer (for the fifth time(!)) to rephrase *"treating cut like an integer"* to *"as an ordered categorical"*. As you're well aware, comments however can't be edited. So the answer is now perfectly fine. All you object to is one stale comment which you're well aware can't be edited. I had already run with `as.numeric(cut)`... – smci May 16 '17 at 06:45
...I had already run with `as.numeric(cut)` a week ago and I'm well aware of the difference between ordered categorical and integer. I even wrote as much to you back on May 7: *"I already referenced that the quadratic, cubic, quartic coeffts were caused by contr.poly with n=4, and I referred the OP to the manpage ?contr.poly"* – smci May 16 '17 at 06:51
...I had already solved the OP's actual question *"How to get cut treated as expected?"* waaay back on May 7 with two viable code workarounds, within 1.5 hrs of them posting it. That was the question being asked. And I even pointed them to `?contr.poly` The OP did not need lengthy detail on contrast treatment of ordered categoricals, and giving them one doesn't answer how to actually change `options('contrasts')` to `contr.treatment`; my answer did. Yes the title says "Understanding..." but the detail of questions 1),2),3) clearly show they want to get unordered-categorical contrast behavior – smci May 16 '17 at 06:59
You keep suggesting that my comments have been nitpicking and that your answer is largely correct in my eyes - I've been trying to tell you it's not. You point out that you haven't used poly contrasts before, so I've been trying to provide specific, detailed responses to the gaps I see in your understanding - as I would want someone to do for me if I was writing about a topic that was new to me. I think it's resulted in improvements in your answer, which is the point, after all. [To learn more](http://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) – Rose Hartman May 17 '17 at 15:31

score 4 · Answer 2 · answered May 08 '17 at 00:58

TL;DR

diamonds$cut has five possible values (fair, good, very good, premium, ideal), so why does the model only show four values for cut?

Factors are usually represented with one fewer coefficients than there are levels in the factor. That's because you're also estimating an intercept for the model. The information you would have gotten for the other level of your factor is instead represented in the intercept.

From my understanding, R treats a categorical variable as being either 1 or 0 in the linear regression equation, so each "cut" coefficient will either be multiplied by 1 or 0 when evaluating a data row. Is that correct?

That is not always the case. That is true for traditional dummy coding (contr.treatment), but there are plenty of other ways to enter factors into a model instead. In the model you presented, you have orthogonal polynomial contrast codes.

3) How do I write a y = a_0 + (a_1 * x_1) + (a_2 * x_2)... from that coefficients given above? Is that possible in this case?

It is not impossible, but it is more difficult (see details below). The polynomial contrast variables can't always be neatly represented in single-group comparisons because they represent overall trends across the levels, so they're harder to think about in terms of regression equations. A decent approximation would be:

lprice = 12.10711 + 1.69577*lcarat + 0.32364*Lin_cut + -0.09583*Qua_cut + 0.07631*Cub_cut + 0.02688*4_cut + error

Where Lin_cut is the linear trend in cut, Qua_cut is the quadratic trend in cut, Cub_cut is the cubic trend in cut, and 4_cut is the 4^ trend in cut.

Longer explanation

cut is an ordered factor, meaning that it's categorical, but it represents some underlying continuous variable, so the order of the levels matters. Note the difference in how R describes cut compared to another factor:

> str(diamonds$cut)
 Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
> str(iris$Species)
 Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Because ordered factors are often analyzed and interpreted a little differently from other factors, R's defaults treat them differently when entered in a lm() model:

> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

To enter a factor with k levels as a predictor in lm(), you need to convert it to k-1 codes instead. There are several ways to do this, all of them fine options; the difference is that they change the interpretation of the coefficients you get from the model, so depending on what kinds of questions you want to answer you'll want to choose one strategy of coding your categorical variables over another.

`contr.treatment`

contr.treatment creates what are sometimes called "traditional" dummy codes. One level (the first level of the factor, by default) is treated as the reference group, and then each code represents the difference between that reference group and each other level.

> lm(Petal.Width ~ Species, data = iris)

Call:
lm(formula = Petal.Width ~ Species, data = iris)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            0.246              1.080              1.780  
> levels(iris$Species)
[1] "setosa"     "versicolor" "virginica"

In this example, the mean of Petal.Width is 0.246 in the reference group (setosa), 0.246 + 1.080 = 1.36 for versicolor, and 0.246 + 1.780 = 2.026 for virginica.

> library(dplyr)
> iris %>% group_by(Species) %>% summarize(Petal.Width = mean(Petal.Width))
# A tibble: 3 × 2
     Species Petal.Width
      <fctr>       <dbl>
1     setosa       0.246
2 versicolor       1.326
3  virginica       2.026

R does the dummy coding for you automatically in the background, but you can always check it with:

> mod$contrasts
$Species
[1] "contr.treatment"

This is what those dummy coded variables would look like:

> contr.treatment(levels(iris$Species))
           versicolor virginica
setosa              0         0
versicolor          1         0
virginica           0         1

There are two dummy codes created (the two columns here), since there are three levels in the factor. For each case in the dataset where the species is setosa, both dummy codes would be 0. When the species is versicolor, the versicolor dummy is 1, and the virginica dummy is 0. When the species is virginica, that dummy is 1 and the other is 0.

`contr.poly`

While you certainly can represent any categorical variable with traditional dummy codes, it's not always the most informative way to do so. The resulting coefficients test each level against the reference level, which may not be of any particular interest in your data. If you do ?contr.treatment in R, you'll see several handy options, although you can also write your own codes from scratch if the built-in ones don't meet your needs.

For ordered factors, R assumes polynomial trend contrasts will be the most useful in most cases, which is why it's the default. You can see how it works with this:

> contr.poly(levels(diamonds$cut))
             .L         .Q            .C         ^4
[1,] -0.6324555  0.5345225 -3.162278e-01  0.1195229
[2,] -0.3162278 -0.2672612  6.324555e-01 -0.4780914
[3,]  0.0000000 -0.5345225 -4.095972e-16  0.7171372
[4,]  0.3162278 -0.2672612 -6.324555e-01 -0.4780914
[5,]  0.6324555  0.5345225  3.162278e-01  0.1195229

These codes are not as straight-forward to interpret as the contr.treatment codes, but plotting may help:

library(tidyr)
library(ggplot2)

contr.poly(levels(diamonds$cut)) %>% 
as.data.frame() %>% 
mutate(level=1:nrow(codes)) %>% 
gather("key", "value", -level) %>% 
ggplot(aes(x=level, y=value,color = key)) + 
geom_line()

This makes it a little clearer that the codes for the linear trend (L) form a straight line, whereas the codes for the quadratic trend form a U-shape, the codes for the cubic trend form a sort of tilted N-shape, and the ^4 trend form a U with a spike in the middle. The contrast codes can be interpreted to mean each of those trends, so the L code is interpreted as the linear trend in the data, the Q code is the quadratic trend in the data, etc.

Each case in the data gets values for each of these four contrast codes, and those contrast code variables are what get used to estimate the model. For example, for a variable with cut="Fair", the values would be -0.632 for the linear contrast code variable, 0.534 for the quadratic, -.316 for the cubic, and 0.119 for the ^4.

For your model, you end up with a positive linear trend for cut (the coefficient for cut.L is positive, and significantly different from zero, which you can see by running summary(mod)). This means that the better the cut, controlling for log(carats), the higher the log(price): Ideal is higher price than Premium, which is higher price than Very Good, etc. You also see a negative quadratic trend, though, which indicates an upside-down U shape. That suggests that the middle-quality cuts are higher log(price) than would be expected from the linear trend --- a positive linear plus a negative quadratic. The positive cubic suggests that there's some drop in log(price) from Good to Premium, or at least less of an increase than would be expected from the linear and quadratic trends. The ^4 trend suggests that the log(price) for Very Good cut is higher relative to Good and Premium than would be expected.

Understanding why linear regression isn't treating my categorical variable as expected?

2 Answers2

TL;DR

Longer explanation

`contr.treatment`

`contr.poly`

Further reading