9

I want to combine a XGBoost model with input scaling and feature space reduction by PCA. In addition, the hyperparameters of the model as well as the number of components used in the PCA should be tuned using cross-validation. And to prevent the model from overfitting, early stopping should be added.

For combining the various steps, I decided to use sklearn's Pipeline functionalities.

At the beginning, I had some problems making sure, that the PCA is also applied to the validation set. But I think using XGB__eval_set makes the deal.

The code is actually running without any errors, but seems to run forever (at some point the CPU usage of all cores goes down to zero but the processes continue to run for hours; had to kill the session at some point).

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor   

# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X_with_features, y, test_size=0.2, random_state=123)

# Train / Validation split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# Pipeline
pipe = Pipeline(steps=[("Scale", StandardScaler()),
                       ("PCA", PCA()),
                       ("XGB", XGBRegressor())])

# Hyper-parameter grid (Test only)
grid_param_pipe = {'PCA__n_components': [5],
                   'XGB__n_estimators': [1000],
                   'XGB__max_depth': [3],
                   'XGB__reg_alpha': [0.1],
                   'XGB__reg_lambda': [0.1]}

# Grid object
grid_search_pipe = GridSearchCV(estimator=pipe,
                                param_grid=grid_param_pipe,
                                scoring="neg_mean_squared_error",
                                cv=5,
                                n_jobs=5,
                                verbose=3)

# Run CV
grid_search_pipe.fit(X_train, y_train, XGB__early_stopping_rounds=10, XGB__eval_metric="rmse", XGB__eval_set=[[X_val, y_val]])
winwin
  • 384
  • 6
  • 20
  • 1
    it seems to be not trivial to apply pipeline transforms to the validation set for early stopping and i doubt `XGB__eval_set` alone is enough. See this sklearn issue https://github.com/scikit-learn/scikit-learn/issues/8414 for a proposed application of the pipeline subset of steps – Mischa Lisovyi Jun 12 '18 at 20:32
  • It's not difficult to `pop` off the last pipeline step(classifier), call `transform` on your data , then re-append the classifier. The challenge is doing it with CV while your early stopping set is not your validation set. This will probably require a custom `GridSearchCV` – Bert Kellerman Jun 13 '18 at 14:17

1 Answers1

11

The problem is that fit method requires an evaluation set created externally, but we cannot create one before the transformation by the pipeline.

This is a bit hacky, but the idea is to create a thin wrapper to the xgboost regressor/classifier that prepare for the evaluation set inside.

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor, XGBClassifier

class XGBoostWithEarlyStop(BaseEstimator):
    def __init__(self, early_stopping_rounds=5, test_size=0.1, 
                 eval_metric='mae', **estimator_params):
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric=eval_metric='mae'        
        if self.estimator is not None:
            self.set_params(**estimator_params)

    def set_params(self, **params):
        return self.estimator.set_params(**params)

    def get_params(self, **params):
        return self.estimator.get_params()

    def fit(self, X, y):
        x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size)
        self.estimator.fit(x_train, y_train, 
                           early_stopping_rounds=self.early_stopping_rounds, 
                           eval_metric=self.eval_metric, eval_set=[(x_val, y_val)])
        return self

    def predict(self, X):
        return self.estimator.predict(X)

class XGBoostRegressorWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBRegressor()
        super(XGBoostRegressorWithEarlyStop, self).__init__(*args, **kwargs)

class XGBoostClassifierWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBClassifier()
        super(XGBoostClassifierWithEarlyStop, self).__init__(*args, **kwargs)

Below is a test.

from sklearn.datasets import load_diabetes
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

x, y = load_diabetes(return_X_y=True)
print(x.shape, y.shape)
# (442, 10) (442,)

pipe = Pipeline([
    ('pca', PCA(5)),
    ('xgb', XGBoostRegressorWithEarlyStop())
])

param_grid = {
    'pca__n_components': [3, 5, 7],
    'xgb__n_estimators': [10, 20, 30, 50]
}

grid = GridSearchCV(pipe, param_grid, scoring='neg_mean_absolute_error')
grid.fit(x, y)
print(grid.best_params_)

If requesting feature requests to the developers, the easiest extension to make is to allow XGBRegressor to create evaluation set internally when not provided. This way, no extension to the scikit-learn is necessary (I guess).

Kota Mori
  • 6,510
  • 1
  • 21
  • 25
  • how about if one wants to use lasso or random forest or in general another predictive model in the same pipeline ? – Areza Dec 17 '18 at 09:50
  • 3
    @Kota Mori, I am not quite sure that this class you have introduced works correctly for Cross validation with early stopping. Suppose you want to do 5 Fold CV. In the first round we take the first 4 folds as training and we fit the model on them and evaluate/validate the model with the last fold to get the score. With your class, you do another train-test split (test_size=0.1 in the fit function in your class) and basically you evaluate the model within the 10% of the first 4 folds instead of validating the model using the last fold. right? Please correct me if I misunderstood sth here. – Amin Kiany Apr 24 '19 at 12:34
  • Great answer! However, there is 1 bug. `set_params` should set `self.estimator=self.estimator.set_params(...)` and return `self` instead. This way, after setting parameter in gridsearchcv, the estimator object will still be this wrapper object rather than the raw `XGBClassifier`. – Tim Sep 20 '22 at 19:20