1

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.

Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.

My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :

In each step of 10 fold cross-validation, data were divided randomly into two groups containing 9/10th data for training and 1/10th for internal validation (i.e., measuring binomial deviance/error of model developed with that lambda). Lambda vs. deviance was plotted. When the process was repeated 9 more times, 95% confidence intervals of lambda vs. deviance were derived. The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance. High lambda is the factor that minimises overfitting because the regression model is not allowed to improve by assigning large coefficients to the variables. The model is then trained on the entire dataset using least squares approximation that minimises model error penalized by lambda term. Because the lambda term is chosen through cross-validation (and not from the entire dataset), the choice of lambda is somewhat independent of the data.

I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this. Thanks in advance.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Maelstorm
  • 580
  • 2
  • 10
  • 29

1 Answers1

0

A bit late I guess, but here goes.

By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."

The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72