How to use GridSearchCV for tuning parameters with train_test_split strategy?

Question

I am trying to fine tune my sklearn models using train_test_split strategy. I am aware of GridSearchCV's ability to perform parameter tuning, however, it was tied to using Cross Validation strategy, I would like to use train_test_split strategy for the parameter searching, for the speed of training is important for my case, I prefer simple train_test_split over cross-validation.

I could try to write my own for loop, but it would be inefficient for not taking advantage of the built-in parallelization used in GridSearchCV.

Anyone knows how to take advantage GridSearchCV for this? Or provide an alternative that wasn't too slow.

score 6 · Accepted Answer · answered Aug 27 '19 at 13:25

Yes, you can use ShuffleSplit for this.

ShuffleSplit is a cross validation strategy like KFold, but unlike KFold where you have to train K models, here you can control how many times to do the train/test split, even once if you prefer.

shuffle_split = ShuffleSplit(n_splits=1,test_size=.25)

n_splits defines how many times to repeat this splitting and training routine. Now you can use it like this:

GridSearchCV(clf,param_grid={},cv=shuffle_split)

score 1 · Answer 2 · answered Aug 28 '19 at 03:19

I would like to add on to Shihab Shahriar's answer, by providing a code sample.

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier

# Load iris dataset
iris = datasets.load_iris()

# Prepare X and y as dataframe
X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.DataFrame(data=iris.target, columns=['Species'])

# Train test split
shuffle_split = ShuffleSplit(n_splits=1, test_size=0.3)
# This is equivalent to: 
#   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# But, it is usable for GridSearchCV

# GridSearch without CV
params = { 'n_estimators': [16, 32] }
clf = RandomForestClassifier()
grid_search = GridSearchCV(clf, param_grid=params, cv=shuffle_split)
grid_search.fit(X, y)

This should help anyone facing a similar problem.

How to use GridSearchCV for tuning parameters with train_test_split strategy?

2 Answers2