17

I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting.

As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this:

import numpy as np
import pandas as pd
from sklearn import model_selection
import xgboost as xgb

#Import training and test data
train = pd.read_csv("train.csv").fillna(value=-999.0)
test = pd.read_csv("test.csv").fillna(value=-999.0)

# Encode variables
y_train = train.price_doc
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)

# XGBoost - sklearn method
gbm = xgb.XGBRegressor()

xgb_params = {
'learning_rate': [0.01, 0.1],
'n_estimators': [2000],
'max_depth': [3, 5, 7, 9],
'gamma': [0, 1],
'subsample': [0.7, 1],
'colsample_bytree': [0.7, 1]
}

fit_params = {
'early_stopping_rounds': 30,
'eval_metric': 'mae',
'eval_set': [[x_train,y_train]]
}

grid = model_selection.GridSearchCV(gbm, xgb_params, cv=5, 
fit_params=fit_params)
grid.fit(x_train,y_train)

The problem I am having is the 'eval_set' parameter. I understand that this wants the predictor and response variables but I am not sure how I can use the cross-validation data as the early stopping criteria.

Does anyone know how to overcome this problem? Thanks.

George
  • 674
  • 2
  • 7
  • 19

3 Answers3

8

You could pass you early_stopping_rounds, and eval_set as an extra fit_params to GridSearchCV, and that would actually work. However, GridSearchCV will not change the fit_params between the different folds, so you would end up using the same eval_set in all the folds, which might not be what you mean by CV.

model=xgb.XGBClassifier()
clf = GridSearchCV(model, parameters,
                         fit_params={'early_stopping_rounds':20,\
                         'eval_set':[(X,y)]},cv=kfold)  

After some tweaking, I found the safest way to integrate early_stopping_rounds and the sklearn API is to implement an early_stopping mechanism your self. You can do it if you do a GridSearchCV with n_rounds as paramter to be tuned. You can then watch the mean_validation_score for the different models with increasing n_rounds. Then you can define a custom heuristic for early stop. it wont save the computational time needed to evaluate all the possible n_rounds though

I think it is also a better approach then using a single split hold-out for this purpose.

00__00__00
  • 4,834
  • 9
  • 41
  • 89
  • @ Martijn Pieters when this is accepted I will mark the other as a duplicate. I cannot since this is yet unasnwered – 00__00__00 Apr 13 '18 at 05:55
  • @00_00_00 Could you specifiy what does `'eval_set':[(X,y)]` do? I think when we clf.fit(X_total,y_total) , the data `X_total` will split into train and test set. If we have `ealy_stopping_rounds` as `fit_params`, the model will train on trainset and evaluate on test set. Thus, we do not need to pass an additional `'eval_set':[(X,y)]` for early_stopping. Do I make sense? – Travis Dec 07 '19 at 13:42
  • 4
    If we do not pass 'eval_set':[(X,y)] into `fit_params`, code will raise error like [this](https://stackoverflow.com/questions/35632634/how-to-pass-a-parameter-to-only-one-part-of-a-pipeline-object-in-scikit-learn). It seems we do a gridsearch with X_total split into n folds and fit the model out of fold n times, each time fit the model early stop with eval_set':[(X,y)]. So I think it does not consistant with the CV idea, which means build model out of fold and evaluate on each fold. – Travis Dec 07 '19 at 14:55
  • @Travis, I don't really understand your point. The early stopping is always done based on the supplied (X,y) for all cv folds. As long as the accuracy metrics is calculated on the hold-out dataset, we will end up getting the correct "mean test" score and hence the best XGB. I don't understand what is inconsistent here. can you please explain? – Balki Jan 01 '22 at 05:45
  • @Balki I believe, Trevor misunderstood what parameters of xgboost do, but the point he made about consistency with CV is valid when the same `(X, y)` are used twice: `gridseach.fit(X, y, fit_params={'eval_set': (X, y)})`. There are several issues here. First, `X_train` and `X` overlap, which will harm early stopping. Second, `X_test` and `X` overlap, which may harm grid search. Neither of these occur if we do `gridseach.fit(X_cv, y_cv, fit_params={'eval_set': (X_holdout, y_holdout)})` instead. This seems the right way, although I am not sure if this is what 00_00_00 meant in their answer. – paperskilltrees May 17 '23 at 21:10
0

Use xgboost original.

Make from data Dmatrix and use xgboost.cv

tutorial

  • This is the only way that works as intended. Ironically, this answer got a -1. Tells a lot about the users. – Michael M Jul 17 '23 at 10:51
0

It does not make much sense to include early stopping in GridSearchCV. The early stopping is used to quickly find the best n_rounds in train/valid situation. If we do not care about 'quickly', we can just tune the n_rounds. Assuming GridSearchCV has the functionality to do the early stopping n_rounds for each fold, then we will have N(number of fold) n_rounds for each set of hyperparameter. Maybe average of n_rounds can be used for final best hyperparameter set, but it might not be a good choice when the n_rounds different too much from each other. So including early stopping in GridSearchCV might increase the speed of trianing, but the result might not be the best.

The suggested method in the accepted answer is more like tuning the n_rounds parameter than early stopping as the author acknowledges that "it wont save the computational time needed to evaluate all the possible n_rounds though".

Ben2018
  • 535
  • 3
  • 11