I learnt how to use R to perform an F-test for a lack of fit of a regression model, where $H_0$: "there is no lack of fit in the regression model".
where df_1 is the degrees of freedom for SSLF (lack-of-fit sum of squares) and df_2 is the degrees of freedom for SSPE (sum of squares due to pure error).
In R, the F-test (say for a model with 2 predictors) can be calculated with
anova(lm(y~x1+x2), lm(y~factor(x1)*factor(x2)))
Example output:
Model 1: y ~ x1 + x2
Model 2: y ~ factor(x1) * factor(x2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 18.122
2 11 12.456 8 5.6658 0.6254 0.7419
F-statistic: 0.6254 with a p-value of 0.7419.
Since the p-value is greater than 0.05, we do not reject $H_0$ that there is no lack of fit. Therefore the model is adequate.
What I want to know is why use 2 models and why use the command factor(x1)*factor(x2)
? Apparently, 12.456 from Model 2
, is magically the SSPE for Model 1
.
Why?