1

I was wondering if you can run RandomizedSearchCV without cross validation (just using a simple train/test split?

I want to do this to be able to ballpark what parameters will be useful for more fine-grained tuning where I would use standard cross validation.

Here is the code :

pipe = Pipeline(steps=[('gbm', GradientBoostingClassifier())])


param_dist = dict(gbm__max_depth=[3,6,10],
                  gbm__n_estimators=[50,100,500,1000],
                  gbm__min_samples_split=[2,5,8,11],
                  gbm__learning_rate=[0.01,0.05,0.1,0.5,1.0],
                  gbm__max_features=['sqrt', 'log2']
                  )

grid_search = RandomizedSearchCV(pipe, param_distributions=param_dist,cv=???)

grid_search.fit(X_train, y_train)

Thanks in advance,

anthonybell
  • 5,790
  • 7
  • 42
  • 60
  • Probably there is not an easy solution for that. Why would you rather avoid cross-validation? I mean CV is the standard way for parameter fitting. It is often the best choice since it tends to be more robust and also avoids subtle overfitting issues to the training/testing set. – cel Mar 22 '15 at 07:26

2 Answers2

0

You can use cv=ShuffleSplit(n_iter=1) to get a single random split, or use cv=PredefinedSplit(...) if there is a particular split you'd like to do (only in the beta 0.16b1 I think). See the docs for options.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
0

Yes, you can perform a RandomizedSearchCV without using cross-validation and instead use a simple train/test split for parameter tuning.

To achieve this, you can utilize the ShuffleSplit class within the sklearn.model_selection module to create a singular train/test division designated for your parameter search. Here's the method to implement this by

integrating just one of the following lines of code:

from sklearn.model_selection import ShuffleSplit

my_cv = ShuffleSplit(n_splits=1)
my_cv = ShuffleSplit(n_splits=1, test_size=0.33, random_state=0)

The first option generates a randomized separation between the training and testing sets, while the second allows you to specify the size of the test set by indicating the desired test size.

Subsequently, you can configure the cv parameter within the RandomizedSearchCV function by setting it to cv=my_cv.

Additionally, it's essential to emphasize that

in this scenario, the RandomizedSearchCV automatically manages the train/test split for you. As a result, it's important to utilize your complete dataset. Instead of employing the (X_train, y_train) , you should employ (features, target) to feed into the random search process. This ensures that the RandomizedSearchCV effectively handles the data partitioning internally.

Here'sis how you can modify your code:

pipe = Pipeline(steps=[('gbm', GradientBoostingClassifier())])

my_cv = ShuffleSplit(n_splits=1, test_size=0.33, random_state=0) # <==========

param_dist = dict(gbm__max_depth=[3,6,10],
                  gbm__n_estimators=[50,100,500,1000],
                  gbm__min_samples_split=[2,5,8,11],
                  gbm__learning_rate=[0.01,0.05,0.1,0.5,1.0],
                  gbm__max_features=['sqrt', 'log2']
                  )

grid_search = RandomizedSearchCV(pipe, param_distributions=param_dist,cv=my_cv)

grid_search.fit(features, **target**) # <==========

In this code, cv represents the train/test split created using ShuffleSplit, and you can customize test_size and other parameters based on your preference.

Hope this helps!

Marawan Mamdouh
  • 584
  • 1
  • 6
  • 15