GridSearchCV not choosing the best hyperparameters for xgboost

Question

I am currently working on developing a regression model with xgboost. Since xgboost has multiple hyperparameters, I have added the cross validation logic with GridSearchCV(). As a trial, I set max_depth: [2,3]. My python code is as below.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error

xgb_reg = xgb.XGBRegressor()

# Obtain the best hyper parameter
scorer=make_scorer(mean_squared_error, False)
params = {'max_depth': [2,3], 
          'eta': [0.1], 
          'colsample_bytree': [1.0],
          'colsample_bylevel': [0.3],
          'subsample': [0.9],
          'gamma': [0],
          'lambda': [1],
          'alpha':[0],
          'min_child_weight':[1]
         }
grid_xgb_reg=GridSearchCV(xgb_reg,
                          param_grid=params,
                          scoring=scorer,
                          cv=5,
                          n_jobs=-1)

grid_xgb_reg.fit(X_train, y_train)
y_pred = grid_xgb_reg.predict(X_test)
y_train_pred = grid_xgb_reg.predict(X_train)

## Evaluate model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print('RMSE  train: %.3f,  test: %.3f' %(np.sqrt(mean_squared_error(y_train, y_train_pred)),np.sqrt(mean_squared_error(y_test, y_pred))))
print('R^2   train: %.3f,  test: %.3f' %(r2_score(y_train, y_train_pred),r2_score(y_test, y_pred)))

The problem is the GridSearchCV does not seem to choose the best hyperparameters. In my case, when I set max_depth as [2,3], The result is as follows. In the following case, GridSearchCV chose max_depth:2 as the best hyper params.

#  The result when max_depth is 2
RMSE  train: 11.861,  test: 15.113
R^2   train: 0.817,  test: 0.601

However, if I updated max_depth to [3](by getting rid of 2), the test score is better than the previous value as follows.

#  The result when max_depth is 3
RMSE  train: 9.951,  test: 14.752
R^2   train: 0.871,  test: 0.620

Question

My understanding is that even if I set max_depth as [2,3], the GridSearchCV method SHOULD choose the max_depth:3 as the best hyperparameters since max_depth:3 can return the better score in terms of RSME or R^2 than max_depth:2. Could anyone tell me why my code cannot choose the best hyperparameters when I set max_depth as [2,3]?

The grid search chooses the best hyperparameters based on its internal cross-validation scores (have a look at its attribute `cv_results_`); the winner there isn't guaranteed to perform best on a new test set. — Ben Reiniger, Oct 04 '21 at 12:34

desertnaut · Accepted Answer · 2021-10-04T22:01:53.387

If you run a second experiment with max_depth:2, then the results are not comparable to the first experiment with max_depth:[2,3] even for the run with max_depth:2, since there are sources of randomness in your code which you do not explicitly control, i.e. your code is not reproducible.

The first source of randomness is the CV folds; in order to ensure that the experiments will be run on identical splits of the data, you should define your GridSearchCV as follows:

from sklearn.model_selection import KFold

seed_cv = 123 # any random value here

kf = KFold(n_splits=5, random_state=seed_cv)

grid_xgb_reg=GridSearchCV(xgb_reg,
                          param_grid=params,
                          scoring=scorer,
                          cv=kf,   # <- change here
                          n_jobs=-1)

The second source of randomness is the XGBRegressor itself, which also includes a random_state argument (see the docs); you should change it to:

seed_xgb = 456 # any random value here (can even be the same with seed_cv)
xgb_reg = xgb.XGBRegressor(random_state=seed_xgb)

But even with these arrangements, while your data splits will now be identical, the regression models built will not be necessarily so in the general case; here, if you keep the experiments like that, i.e. first with max_depth:[2,3] and then with max_depth:2, the results will be identical indeed; but if you change it to, say, first with max_depth:[2,3] and then with max_depth:3, they will not, since in the first experiment, the run with max_depth:3 will start with a different state of the random number generator (i.e. the one after the run with max_depth:2 has finished).

There are limits to how identical you can make different runs under such conditions; for an example of a very subtle difference that nevertheless destroys the exact reproducibility between two experiments, see my answer in Why does the importance parameter influence performance of Random Forest in R?

Thanks! You are saying that if I run `GridSearchCV` with `max_depth:[2,3]`, the result of `max_depth:3` is affected by the former result of `max_depth:2`, so the best hyper parameter would become `max_depth:2`, am I correct? If yes, I would like to make each result identical. In that case, should I run the `for` loop and do my manual `GridSearch` like you explain in the another thread? — shumach5, Oct 04 '21 at 23:23
@shumach5 I did not say anything about best hyperparams; I only said that the results will not be identical with running a single run with `max_depth:3` , and I explained why. Unfortunately, not all experiments can be fully comparable in that sense, as I also mention - randomness enters from too many angles — desertnaut, Oct 05 '21 at 00:33
Thank you for your answer! So the key factor that I missed was that the result is not identical. — shumach5, Oct 07 '21 at 23:22

GridSearchCV not choosing the best hyperparameters for xgboost

Question

1 Answers1

Linked