I wish to confirm my understanding of CV procedure in the glmnet
package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet
I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet
does this is :
In each step of 10 fold cross-validation, data were divided randomly into two groups containing 9/10th data for training and 1/10th for internal validation (i.e., measuring binomial deviance/error of model developed with that lambda). Lambda vs. deviance was plotted. When the process was repeated 9 more times, 95% confidence intervals of lambda vs. deviance were derived. The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance. High lambda is the factor that minimises overfitting because the regression model is not allowed to improve by assigning large coefficients to the variables. The model is then trained on the entire dataset using least squares approximation that minimises model error penalized by lambda term. Because the lambda term is chosen through cross-validation (and not from the entire dataset), the choice of lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this. Thanks in advance.