15

Gridsearhcv uses StratifiedKFold or KFold. So my question is that should I split my data into train and test before using gridsearch, then do fitting only for test data? I am not sure whether it is necessary because cv method already splits the data but I have seen some examples which split data beforehand.

Thank you.

Kübra Kutlu
  • 163
  • 1
  • 1
  • 7

1 Answers1

28

GridSearchCV will take the data you give it, split it into Train and CV set and train algorithm searching for the best hyperparameters using the CV set. You can specify different split strategies if you want (for example proportion of split).

But when you perform hyperparameter tuning information about dataset still 'leaks' into the algorithm.

Hence I would advice to take the following approach:

1) Take your original dataset and hold out some data as a test set (say, 10%)

2) Use grid search on remaining 90%. Split will be done for you by the algorithm here.

3) After you got optimal hyperparameters, test it on the test set from #1 to get final estimate of the performance you can expect on new data.

Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
  • 2
    Wouldn't be better to do GridSearchCV over the whole training dataset, since it already does CV and, once search is done, use the found classifier to fit and predict on training-test data split? – fjsj Nov 29 '18 at 00:16
  • 4
    @fjsj it is a valid point but still, during the grid search process some information about the dataset, over which you perform a grid search leaks into the hyperparameters. To get a final, unbiased performance on the new data you need to hold out a sample of a dataset which was never seen by fit classifier - ever, directly or indirectly. It is especially relevant if you are doing a prediction of time series data - better to do search on, say, Jan - Nov data and then do final test on Dec data to get a real estimation of performance – Maksim Khaitovich Nov 29 '18 at 07:33
  • 1
    @MaksimKhaitovich - in some places I've seen a variation on your answer: people taking the `best_params` from gs and using those to call `fit()` on the entire dataset (including the test part), before calling `predict()`; basically, calling `fit()` twice - once as part of the gs, and again outside it. Does it make sense to you? – Jack Fleeting Feb 13 '19 at 18:59
  • 2
    @JackFleeting not sure about the nature of the question. I mean, if you call fit as part of GS you find best hyperparameters using subset of data you used (or if you are doing Kfold on a lot of subsets to get an estimate of best hyperparameters). The second time you call fit you cover the full set of training data – Maksim Khaitovich Feb 18 '19 at 06:17