0

In R, I used polynomial regression for the database below. It shows that the R2 is good and both the significance level for the coefficients and the model are less than 0.05. But when using the shapiro.test for testing residuals, the p-value is 0.01088 which means that the residuals are not in line with normal distribution. So I wonder whether the polynomial regression is effective or not. Does the residuals of the polynomial regression have to satisfy the normality hypothesis?

Attached below are the code and the data used for regression.

alloy<-data.frame(
  x=c(37.0, 37.5, 38.0, 38.5, 39.0, 39.5, 40.0,
      40.5, 41.0, 41.5, 42.0, 42.5, 43.0),
  y=c(3.40, 3.00, 3.00, 3.27, 2.10, 1.83, 1.53,
      1.70, 1.80, 1.90, 2.35, 2.54, 2.90))

lm.sol=lm(y~x+I(x^2),data=alloy)
summary(lm.sol)

y.res=lm.sol$residuals
shapiro.test(y.res)
Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
mcxmcx
  • 13
  • 1
  • 3

1 Answers1

0

Well ... this question probably belongs to stat.exchange since it has little to do with programming. However, here's my brief take on your data.

R2 and shapiro.test address different features of the data and model fit, so you can have that one is "good"* and the other is not (for sufficiently vague definitions of "good" and "not").

If you plot your data and your fit in the same graph then you see that the overall trend is nicely captured by your quadratic regression model.

plot(y ~ x, data=alloy)
lines(alloy$x, predict(lm.sol))

enter image description here

The model does quite nicely. You can also see that the qq-plot of the residuals indicates that there might be a problem with variance homogeneity (see the last residual).

qqnorm(resid(lm.sol))

enter image description here

In other words, the residuals may not necessarily follow a Gaussian distribution but the overall trend in the data is captured.

Did that help?

ekstroem
  • 5,957
  • 3
  • 22
  • 48
  • Great answer! It means that although sometimes the the p-value of shapiro.test for the residuals is less than 0.05, the model can also be regarded as a successful one if its R2 is high and it satisfies the t.test and F.test for the cofficient and model respectively? – mcxmcx May 29 '17 at 21:11
  • Also, in the original model, the residual of the point 4 is the biggest one. So I delete this point and then build the same polynomial regression. Based on the new results, it shows that the R2 is increased to 0.9402 and the model satisfies the t.test and F.test. What's more, the p-value of the shapiro.test for the residuals is also more than 0.05. Based on your suggestion, which model should I use in the future research (the original model VS new one)? – mcxmcx May 29 '17 at 21:19
  • To answer your last question first: I wouldn't delete points willy-nilly without having a good reason for it. If the model doesn't fit well then that is a problem with the model - not the data. In other words - I'd rather use the first model on the original data than the other one. I doubt the fitted curves will be much different. – ekstroem May 29 '17 at 22:01
  • A model can be a good model (and even a correct model) even with R2 low - see this [great post](https://stats.stackexchange.com/questions/13314/is-r2-useful-or-dangerous). A model can be "good" or "relevant" if it provides a reasonable abstraction of a proces - it all depends on what you want to use the model for. But that question really belongs on stats.exchange – ekstroem May 29 '17 at 22:09