0

A colleague and I noticed this interesting quark in the lm function in R.

Say, I am regressing y variable on an x variable and x variable is a factor level variable with two categories (0/1).

When I run the regression and examine the fitted values, there should be two fitted values. One for the intercept and one fitted value when beta = 1.

Instead, there are more than two. Three nearly identical fitted values for the intercept and one fitted value when beta = 1.

Among those that are different, the difference occurs at the last decimal point.

What might be occurring within R that produces this quark? Why are the intercept's fitted values nearly identical but not perfectly identical?

set.seed(1995)
x <- sample(c(0,1), 100, replace = T, prob = c(.5,.5))
y <- runif(100, min = 1, max = 100)

df <- data.frame(x, y)

OLS <- lm(y ~ as.factor(x), data = df)
summary(OLS)

Call:
lm(formula = y ~ as.factor(x), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.374 -25.163   1.776  25.521  46.571 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     54.503      4.176   13.05 <0.0000000000000002 ***
as.factor(x)1   -5.117      5.683   -0.90                0.37    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.33 on 98 degrees of freedom
Multiple R-squared:  0.008205,  Adjusted R-squared:  -0.001916 
F-statistic: 0.8107 on 1 and 98 DF,  p-value: 0.3701

table(OLS$fitted.values)

 49.385426930928 54.5027935733593 54.5027935733594 54.5027935733595 
              54               32               13                1 

My hunch is that this is the product of numerical errors as outlined in the first circle of Burn's (2011) R Inferno?

Sharif Amlani
  • 1,138
  • 1
  • 11
  • 25
  • 1
    It is not 100% clear what kind of an answer you are looking for. Are you asking why this is happening? Or how to properly fit a linear model to categorical variables? – Artem Sokolov Jul 22 '20 at 21:47
  • Why is this happening? Why are the fitted values are essentially identical but not perfectly identical? – Sharif Amlani Jul 22 '20 at 21:48
  • 1
    You could dig through the machinery of `lm.fit` to find out. The short answer is yes, this is due to floating-point inaccuracies in the computation. In order to find out *exactly* what's going on you'd presumably have to look at the precise steps in the QR decomposition (the core of the numerical operation is `z <- .Call(C_Cdqrls, x, y, tol, FALSE)`) – Ben Bolker Jul 22 '20 at 22:01
  • Great! Thanks so much for the feedback! – Sharif Amlani Jul 22 '20 at 22:05

0 Answers0