Scikit-learn - Can you run RandomizedSearchCV without cross validation?

Question

I was wondering if you can run RandomizedSearchCV without cross validation (just using a simple train/test split?

I want to do this to be able to ballpark what parameters will be useful for more fine-grained tuning where I would use standard cross validation.

Here is the code :

pipe = Pipeline(steps=[('gbm', GradientBoostingClassifier())])


param_dist = dict(gbm__max_depth=[3,6,10],
                  gbm__n_estimators=[50,100,500,1000],
                  gbm__min_samples_split=[2,5,8,11],
                  gbm__learning_rate=[0.01,0.05,0.1,0.5,1.0],
                  gbm__max_features=['sqrt', 'log2']
                  )

grid_search = RandomizedSearchCV(pipe, param_distributions=param_dist,cv=???)

grid_search.fit(X_train, y_train)

Thanks in advance,

Probably there is not an easy solution for that. Why would you rather avoid cross-validation? I mean CV is the standard way for parameter fitting. It is often the best choice since it tends to be more robust and also avoids subtle overfitting issues to the training/testing set. — cel, Mar 22 '15 at 07:26

score 0 · Answer 1 · answered Mar 23 '15 at 13:21

0

You can use cv=ShuffleSplit(n_iter=1) to get a single random split, or use cv=PredefinedSplit(...) if there is a particular split you'd like to do (only in the beta 0.16b1 I think). See the docs for options.

answered Mar 23 '15 at 13:21

Andreas Mueller

27,470
8
62
74

score 0 · Answer 2 · answered Aug 28 '23 at 02:27

Yes, you can perform a RandomizedSearchCV without using cross-validation and instead use a simple train/test split for parameter tuning.

To achieve this, you can utilize the ShuffleSplit class within the sklearn.model_selection module to create a singular train/test division designated for your parameter search. Here's the method to implement this by

integrating just one of the following lines of code:

from sklearn.model_selection import ShuffleSplit

my_cv = ShuffleSplit(n_splits=1)
my_cv = ShuffleSplit(n_splits=1, test_size=0.33, random_state=0)

The first option generates a randomized separation between the training and testing sets, while the second allows you to specify the size of the test set by indicating the desired test size.

Subsequently, you can configure the cv parameter within the RandomizedSearchCV function by setting it to cv=my_cv.

Additionally, it's essential to emphasize that

in this scenario, the RandomizedSearchCV automatically manages the train/test split for you. As a result, it's important to utilize your complete dataset. Instead of employing the (X_train, y_train) , you should employ (features, target) to feed into the random search process. This ensures that the RandomizedSearchCV effectively handles the data partitioning internally.

Here'sis how you can modify your code:

pipe = Pipeline(steps=[('gbm', GradientBoostingClassifier())])

my_cv = ShuffleSplit(n_splits=1, test_size=0.33, random_state=0) # <==========

param_dist = dict(gbm__max_depth=[3,6,10],
                  gbm__n_estimators=[50,100,500,1000],
                  gbm__min_samples_split=[2,5,8,11],
                  gbm__learning_rate=[0.01,0.05,0.1,0.5,1.0],
                  gbm__max_features=['sqrt', 'log2']
                  )

grid_search = RandomizedSearchCV(pipe, param_distributions=param_dist,cv=my_cv)

grid_search.fit(features, **target**) # <==========

In this code, cv represents the train/test split created using ShuffleSplit, and you can customize test_size and other parameters based on your preference.

Hope this helps!

Scikit-learn - Can you run RandomizedSearchCV without cross validation?

2 Answers2

integrating just one of the following lines of code:

Additionally, it's essential to emphasize that

Here'sis how you can modify your code: