I have confusion related to the following things:
- Data splitting into training, validation, and testing
- How and at which step should the hyperparameter tuning be performed, and which data should be used for this purpose?
- Can a stratified k-fold cross-validation be performed on the best_estimator_ obtained via randomizedsearchcv?
- Finally, which model should be used for deployment? Do I need to retrain the best_estimator_ on the entire training dataset to get the final model?
I have tried an approach for hyperparameter tuning and cross-validation of the Sklearn model, for which I need confirmation from machine-learning experts if the approach is correct or not.
Let me explain briefly what I have done:
I have a training dataset and a separate testing dataset. I used the training dataset and performed hyperparameter tuning via randomizedsearchcv with cv = 5. Once I got the best_estimator_, I then used a stratified 5-fold cross-validation strategy where the training dataset was split into train and validation sets. The best_estimator_ was retrained for the train set and validated on the validation set in each iteration in order to check the generalization of the best_estimator_. And finally, the performance of the best_estimator_ is evaluated on the separate unseen testing dataset.
Your correction or suggestion would be highly appreciated. Thanks in anticipation!