2

I've been self-studying Discovering Statistics Using R by Andy Field and have come across this passage:

Data splitting: This approach involves randomly splitting your data set, computing a regression equation on both halves of the data and then comparing the resulting models. When using stepwise methods, cross-validation is a good idea; you should run the stepwise regression on a random selection of about 80% of your cases. Then force this model on the remaining 20% of the data. By comparing values of the R2 and b-values in the two samples you can tell how well the original model generalizes (see Tabachnick & Fidell, 2007, for more detail).

Alright, I understand subsetting my data (using sample()), and I know how to fit linear models (using lm()), but the line "Then force this model on the remaining 20% of the data" confuses me.

This technique is never brought up again in the book. Is there some function in R that allows you to force a model onto data and computes R^2 and b-values using that forced model? Perhaps some function where you input intercept and slope coefficients into it and it outputs something like summary(lm) does?

Or am I not understanding what this passage is trying to say?

  • 2
    have a look at `predict()` function - `?predict`. – Bulat May 31 '16 at 22:17
  • I believe by "force" they just mean "use". Basically since they advocated stepwise regression for the original 80% of the data (which is a bad suggestion but I don't want to get into that) they are saying whatever model you decided on ultimately - that's the model you want to fit on the remaining data. So if the model ended up being x1+x2+x5 for the original data use the model x1+x2+x5 on the other 20%. – Dason May 31 '16 at 22:23
  • Also for cross-validation have a look at `CV` function in forecasting package, here is a bit of details on cv metrics - http://robjhyndman.com/hyndsight/crossvalidation/ – Bulat May 31 '16 at 22:27
  • Can you transcribe the text rather than adding an image? – OrangeDog May 31 '16 at 23:00

2 Answers2

2

You use the predict function, with new data.

I don't have the book to hand, so I can't tell you the exact example, but if the remaining 20% of your data is a data frame called 'holdout', and your regression model is called 'reg1' then use:

holdout$pred <- predict(reg1, newdata=holdout)

Then you can calculate $R^2$ by looking at the correlation between the predicted score, and the original outcome score. If the outcome is called 'out', then:

cor(holdout$pred, holdout$out)^2

Should do the trick.

Jeremy Miles
  • 349
  • 1
  • 16
  • Thank you for the predict() function, but what am I to make of the line "com[pare] values of R^2 and b-values". As far as I can tell predict() does not give me any statistical tests or coefficients. – user3547456 May 31 '16 at 23:29
  • 1
    @user3547456 After you use predict you have values that would support an R^2 calculation comparing predictions to actual. I'm guessing "b-values" is a misspelling of "p-values". I think this is actually a question for stats.stackexchange.com rather than SO since you appear in serious need of statistical advice. – IRTFM May 31 '16 at 23:36
  • Is "b-values" referring to the coefficients from a logistic regression? – thelatemail Jun 01 '16 at 01:27
2

I second what Jeremy said. Here is an example with some made up data that you can run to get the feel for it:

set.seed(26) 

mydf = data.frame (a=1:20 , b = rnorm(20), c = 1:20 + runif(20), d = 1:20 +   runif(1:20)*sin(1:20))

trainRows<-sample(1:20, 16)
mydf.train<-mydf[trainRows,]
mydf.test<-mydf[-trainRows,]

myModel<-lm(a~., data = mydf.train)
model1<-step(myModel)

summary(model1)

mydf.test$pred<-predict(model1, newdata = mydf.test)

cor(mydf.test$pred, mydf.test$a)^2
#[1] 0.9999522
Bryan Goggin
  • 2,449
  • 15
  • 17