I've been self-studying Discovering Statistics Using R by Andy Field and have come across this passage:
Data splitting: This approach involves randomly splitting your data set, computing a regression equation on both halves of the data and then comparing the resulting models. When using stepwise methods, cross-validation is a good idea; you should run the stepwise regression on a random selection of about 80% of your cases. Then force this model on the remaining 20% of the data. By comparing values of the R2 and b-values in the two samples you can tell how well the original model generalizes (see Tabachnick & Fidell, 2007, for more detail).
Alright, I understand subsetting my data (using sample()
), and I know how to fit linear models (using lm()
), but the line
"Then force this model on the remaining 20% of the data"
confuses me.
This technique is never brought up again in the book. Is there some function in R that allows you to force a model onto data and computes R^2
and b-values
using that forced model? Perhaps some function where you input intercept and slope coefficients into it and it outputs something like summary(lm)
does?
Or am I not understanding what this passage is trying to say?