2

I learnt how to use R to perform an F-test for a lack of fit of a regression model, where $H_0$: "there is no lack of fit in the regression model".

$$F_{LOF} = \frac{MSLF}{MSPE} = \frac{SSLF(\text{model}) / df_1}{SSPE/df_2}$$

where df_1 is the degrees of freedom for SSLF (lack-of-fit sum of squares) and df_2 is the degrees of freedom for SSPE (sum of squares due to pure error).

In R, the F-test (say for a model with 2 predictors) can be calculated with

anova(lm(y~x1+x2), lm(y~factor(x1)*factor(x2)))

Example output:

Model 1: y ~ x1 + x2
Model 2: y ~ factor(x1) * factor(x2)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     19 18.122                           
2     11 12.456  8    5.6658 0.6254 0.7419

F-statistic: 0.6254 with a p-value of 0.7419.

Since the p-value is greater than 0.05, we do not reject $H_0$ that there is no lack of fit. Therefore the model is adequate.

What I want to know is why use 2 models and why use the command factor(x1)*factor(x2)? Apparently, 12.456 from Model 2, is magically the SSPE for Model 1.

Why?

hongsy
  • 1,498
  • 1
  • 27
  • 39
  • If you have a question about statistical tests, you should ask over at [stats.se]; Stack Overflow site is for specific programming questions. – MrFlick Jul 28 '17 at 04:21
  • thank you for the advice, I have been granted permission by @daniel-fischer [here](https://math.stackexchange.com/questions/2140676/using-r-for-lack-of-fit-test) to re-ask, since the original did not gather any answer nor comment. – hongsy Jul 28 '17 at 05:39
  • the related question is [here](https://stats.stackexchange.com/questions/287532/r-lack-of-fit-test-via-anova) – hongsy Jul 28 '17 at 05:40

1 Answers1

2

You are testing whether a model with an interaction improves the model fit.

Model 1 corresponds to an additive effect of x1 and x2.

One way to "check" if the complexity of a model is adequate (in your case whether a multiple regression with additive effects make sense for your data) is to compare the proposed model with a more flexible/complex model.

Your model 2 has the role of this more flexible model. First the predictors are made categorical (by using factor(x1) and factor(x2)) and then an interaction between them is constructed by factor(x1)*factor(x2). The interaction model includes the additive model as a special case (i.e., model 1 is nested in model 2) and has several extra parameters to provide a potentially better fit to the data.

You can see difference in number of parameters between the two models in the output from anova. Model 2 has 8 extra parameters to allow for a better fit but because the p-value is non-significant you would conclude that model 2 (with the extra flexibility based on the additional 8 parameters) actually does not provide a significantly better fit to the data. Thus, the additive model provides a decent enough fit to the data when compared to model 2.

Note that the trick above with making categories (factors) of x1 and x2 only really works when number of unique values for x1 and x2 is low. If x1 and x2 are numeric and each individual has their own value then model 2 is not that useful as you end up with the same number of parameters as you hav observations. In those situations more ad hoc modifications such are binning the variables are used.

ekstroem
  • 5,957
  • 3
  • 22
  • 48
  • WOW!! thank you for answer a question which has eluded math.stackexchange and stats.stackexchange for months!! Is it possible to "figure out" what the 8 extra params that Model 2 has are? – hongsy Jul 28 '17 at 09:30
  • 1
    Yes that is possible but it depends on the data (the number of levels of `x1` and `x2`). Your model 1 has 3 mean parameters (intercept, beta_x1, beta_x2). If you type `summary(model2)` you should see the parameters that are present in that model so you can get a gist from there – ekstroem Jul 28 '17 at 09:36