A colleague and I noticed this interesting quark in the lm function in R.
Say, I am regressing y variable on an x variable and x variable is a factor level variable with two categories (0/1).
When I run the regression and examine the fitted values, there should be two fitted values. One for the intercept and one fitted value when beta = 1.
Instead, there are more than two. Three nearly identical fitted values for the intercept and one fitted value when beta = 1.
Among those that are different, the difference occurs at the last decimal point.
What might be occurring within R that produces this quark? Why are the intercept's fitted values nearly identical but not perfectly identical?
set.seed(1995)
x <- sample(c(0,1), 100, replace = T, prob = c(.5,.5))
y <- runif(100, min = 1, max = 100)
df <- data.frame(x, y)
OLS <- lm(y ~ as.factor(x), data = df)
summary(OLS)
Call:
lm(formula = y ~ as.factor(x), data = df)
Residuals:
Min 1Q Median 3Q Max
-52.374 -25.163 1.776 25.521 46.571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.503 4.176 13.05 <0.0000000000000002 ***
as.factor(x)1 -5.117 5.683 -0.90 0.37
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.33 on 98 degrees of freedom
Multiple R-squared: 0.008205, Adjusted R-squared: -0.001916
F-statistic: 0.8107 on 1 and 98 DF, p-value: 0.3701
table(OLS$fitted.values)
49.385426930928 54.5027935733593 54.5027935733594 54.5027935733595
54 32 13 1
My hunch is that this is the product of numerical errors as outlined in the first circle of Burn's (2011) R Inferno?