TL;DR
diamonds$cut has five possible values (fair, good, very good, premium, ideal), so why does the model only show four values for cut?
Factors are usually represented with one fewer coefficients than there are levels in the factor. That's because you're also estimating an intercept for the model. The information you would have gotten for the other level of your factor is instead represented in the intercept.
From my understanding, R treats a categorical variable as being either 1 or 0 in the linear regression equation, so each "cut" coefficient will either be multiplied by 1 or 0 when evaluating a data row. Is that correct?
That is not always the case. That is true for traditional dummy coding (contr.treatment
), but there are plenty of other ways to enter factors into a model instead. In the model you presented, you have orthogonal polynomial contrast codes.
3) How do I write a y = a_0 + (a_1 * x_1) + (a_2 * x_2)... from that coefficients given above? Is that possible in this case?
It is not impossible, but it is more difficult (see details below). The polynomial contrast variables can't always be neatly represented in single-group comparisons because they represent overall trends across the levels, so they're harder to think about in terms of regression equations. A decent approximation would be:
lprice = 12.10711 + 1.69577*lcarat + 0.32364*Lin_cut + -0.09583*Qua_cut + 0.07631*Cub_cut + 0.02688*4_cut + error
Where Lin_cut is the linear trend in cut, Qua_cut is the quadratic trend in cut, Cub_cut is the cubic trend in cut, and 4_cut is the 4^ trend in cut.
Longer explanation
cut
is an ordered factor, meaning that it's categorical, but it represents some underlying continuous variable, so the order of the levels matters. Note the difference in how R describes cut
compared to another factor:
> str(diamonds$cut)
Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
> str(iris$Species)
Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Because ordered factors are often analyzed and interpreted a little differently from other factors, R's defaults treat them differently when entered in a lm()
model:
> options("contrasts")
$contrasts
unordered ordered
"contr.treatment" "contr.poly"
To enter a factor with k levels as a predictor in lm()
, you need to convert it to k-1 codes instead. There are several ways to do this, all of them fine options; the difference is that they change the interpretation of the coefficients you get from the model, so depending on what kinds of questions you want to answer you'll want to choose one strategy of coding your categorical variables over another.
contr.treatment
contr.treatment
creates what are sometimes called "traditional" dummy codes. One level (the first level of the factor, by default) is treated as the reference group, and then each code represents the difference between that reference group and each other level.
> lm(Petal.Width ~ Species, data = iris)
Call:
lm(formula = Petal.Width ~ Species, data = iris)
Coefficients:
(Intercept) Speciesversicolor Speciesvirginica
0.246 1.080 1.780
> levels(iris$Species)
[1] "setosa" "versicolor" "virginica"
In this example, the mean of Petal.Width is 0.246 in the reference group (setosa), 0.246 + 1.080 = 1.36 for versicolor, and 0.246 + 1.780 = 2.026 for virginica.
> library(dplyr)
> iris %>% group_by(Species) %>% summarize(Petal.Width = mean(Petal.Width))
# A tibble: 3 × 2
Species Petal.Width
<fctr> <dbl>
1 setosa 0.246
2 versicolor 1.326
3 virginica 2.026
R does the dummy coding for you automatically in the background, but you can always check it with:
> mod$contrasts
$Species
[1] "contr.treatment"
This is what those dummy coded variables would look like:
> contr.treatment(levels(iris$Species))
versicolor virginica
setosa 0 0
versicolor 1 0
virginica 0 1
There are two dummy codes created (the two columns here), since there are three levels in the factor. For each case in the dataset where the species is setosa, both dummy codes would be 0. When the species is versicolor, the versicolor dummy is 1, and the virginica dummy is 0. When the species is virginica, that dummy is 1 and the other is 0.
contr.poly
While you certainly can represent any categorical variable with traditional dummy codes, it's not always the most informative way to do so. The resulting coefficients test each level against the reference level, which may not be of any particular interest in your data. If you do ?contr.treatment
in R, you'll see several handy options, although you can also write your own codes from scratch if the built-in ones don't meet your needs.
For ordered factors, R assumes polynomial trend contrasts will be the most useful in most cases, which is why it's the default. You can see how it works with this:
> contr.poly(levels(diamonds$cut))
.L .Q .C ^4
[1,] -0.6324555 0.5345225 -3.162278e-01 0.1195229
[2,] -0.3162278 -0.2672612 6.324555e-01 -0.4780914
[3,] 0.0000000 -0.5345225 -4.095972e-16 0.7171372
[4,] 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
[5,] 0.6324555 0.5345225 3.162278e-01 0.1195229
These codes are not as straight-forward to interpret as the contr.treatment
codes, but plotting may help:
library(tidyr)
library(ggplot2)
contr.poly(levels(diamonds$cut)) %>%
as.data.frame() %>%
mutate(level=1:nrow(codes)) %>%
gather("key", "value", -level) %>%
ggplot(aes(x=level, y=value,color = key)) +
geom_line()

This makes it a little clearer that the codes for the linear trend (L) form a straight line, whereas the codes for the quadratic trend form a U-shape, the codes for the cubic trend form a sort of tilted N-shape, and the ^4 trend form a U with a spike in the middle. The contrast codes can be interpreted to mean each of those trends, so the L code is interpreted as the linear trend in the data, the Q code is the quadratic trend in the data, etc.
Each case in the data gets values for each of these four contrast codes, and those contrast code variables are what get used to estimate the model. For example, for a variable with cut="Fair", the values would be -0.632 for the linear contrast code variable, 0.534 for the quadratic, -.316 for the cubic, and 0.119 for the ^4.
For your model, you end up with a positive linear trend for cut (the coefficient for cut.L is positive, and significantly different from zero, which you can see by running summary(mod)
). This means that the better the cut, controlling for log(carats), the higher the log(price): Ideal is higher price than Premium, which is higher price than Very Good, etc. You also see a negative quadratic trend, though, which indicates an upside-down U shape. That suggests that the middle-quality cuts are higher log(price) than would be expected from the linear trend --- a positive linear plus a negative quadratic. The positive cubic suggests that there's some drop in log(price) from Good to Premium, or at least less of an increase than would be expected from the linear and quadratic trends. The ^4 trend suggests that the log(price) for Very Good cut is higher relative to Good and Premium than would be expected.
Further reading
For a much more in-depth explanation of polynomial trend contrasts, see this excellent answer on Cross Validated.