Is cross validation used for model selection?

Question

So this is starting to confuse me a bit. Having for example the following code that trains a GLM model:

glm_sens = train(
  form = target ~ .,
  data = ABT,
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE),
  method = "glm",
  family = "binomial",
  metric = "Sens"
)

I expected that this trains a few models and then selects the one that performs best on the sensitivity. Yet when I read up on cross validation most I find is on how it is used to calculate average performance scores.

Was my assumption wrong?

score 0 · Answer 1 · answered Jun 16 '20 at 22:57

caret does train different models, but normally it is done with different hyper-parameters. You can check out an an explanation of the process. Hyper parameters cannot be directly learned from the data so you need the training process. These parameters decide how your model will behave, for example you have lambda in lasso which decides how much regularization applied to the model.

In a glm, there is no hyper-parameter to train. I guess what you are looking for is something to select the best possible linear model out of the many potential variables. You can use step()

fit = lm(mpg ~ .,data=mtcars)
step(fit,direction="back")

Another option is to use leaps with caret, for example an equivalent of the above will be:

train(mpg~ .,data=mtcars,method='leapBackward', trControl=trainControl(method="cv",number=10),tuneGrid=data.frame(nvmax=2:6)) 

Linear Regression with Backwards Selection 

32 samples
10 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 30, 28, 28, 28, 30, 28, ... 
Resampling results across tuning parameters:

  nvmax  RMSE      Rsquared   MAE     
  2      3.299712  0.9169529  2.783068
  3      3.124146  0.8895539  2.750305
  4      3.249803  0.8849213  2.853777
  5      3.258143  0.8939493  2.823721
  6      3.123481  0.8917197  2.723475

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 6.

You can check out more about variable selection using leaps in this website

Is cross validation used for model selection?

1 Answers1