Which model: Best estimator from gridsearchCV or all training data?

Question

I am a little confused when it comes to gridsearch and fitting the final model. I split the in 2: training and testing. The testing set is only used for final evaluation. I perform grid search only using the training data.

Say one has done a grid search over several hyperparameters using cross-validation. The grid search gives the best combination of the hyperparameters. Next step is to train the model, and this is where I am confused. I see 2 possibilities:

1) Don't train the model. Use the parameters from the best model from the grid search.

or

2) Don't use the parameters from the best model from the grid search. Train the model on the full training set with the best hyperparameter combination from the grid search.

What is the correct approach, 1 or 2?

There is no option here. What GridSearchCV has found are hyper-parameters (which are used to initialize and control the model and learning) but the model still needs to learn about the data (The parameters learnt by the model are very different than the hyper-parameters found by GridSearchCV. You have only option 2 as viable. And GridSearchCV will even do that for you, you only need to call `predict()` with your new (test) data. — Vivek Kumar, Oct 29 '18 at 06:43
Thanks for replying, Vivek Kumar. I think I might have been a little unclear. I was not asking about the predictions using the test set, which you mention in your last sentence. I was asking about how to fit the the model to get the parameters which later can be used for e.g. predictions. Should I 1) use the fitted parameters from the best grid search, or should I fit the full training set using the best hyperparameters combo from the grid search? — KJA, Oct 30 '18 at 07:17

score 8 · Answer 1 · answered Jul 16 '20 at 23:24

This is probably late, but might be useful for someone else who comes along.

GridSearchCV has an attribute called refit, which is set to True by default. This means that after performing k-fold cross-validation (i.e., training on a subset of the data you passed in), it refits the model using the best hyperparameters from the grid search, on the complete training set.

Presumably your question, from what I can glean, can be summarized as:

Suppose you use 5-fold cross-validation. Your model is then fitted only on 4 folds, as the fifth fold is used for validation. So would you need to retrain the model on the whole of train (i.e., the data from all 5 folds)?

The answer is no, provided you set refit to True, in which case GridSearchCV will perform the training over the whole of the training set using the best hyperparameters it has found after cross-validation. It will then return the trained estimator object, on which you can directly call the predict method, as you would normally do otherwise.

Refer: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

score 0 · Accepted Answer · answered Oct 28 '18 at 13:29

0

You train the model using the training set and the parameters obtained by the GridSearch.

And then you can test the model with the test set.

answered Oct 28 '18 at 13:29

Franco Piccolo

6,845
8
34
52

I have split the data in training set and testing set. – KJA Oct 28 '18 at 14:34
When doing GridSearch I use crossvalidation on the training set. For 5 fold cross validation there will be run estimations on 5 different data sets from the training set. For a given hyperparameter combo, one of these models will be the best, and the best model is kept. This procedure is repeated for all hyperparameter combos. The best hyperparameter is the best model among the winners from the different hyperparameter combos. But this model applied only 4/5 of the training data. Another option is to use the same hyperparameters as the winning model, but make a fit to the full training set. – KJA Oct 28 '18 at 14:44
Yes for the final training you use the full training set. – Franco Piccolo Oct 28 '18 at 15:33
@FrancoPiccolo, Why do we need to train the final model _(e.g found by sklearn best_estimator_)_ once again as the final model is found by this training data set? – Md. Sabbir Ahmed May 31 '20 at 04:00

Which model: Best estimator from gridsearchCV or all training data?

2 Answers2