XGboost: cannot pass validation data for eval_set in pipeline

Question

I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

And I want to pass these fit params

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

I am trying to fit model

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool

I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.

Full code

columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()

num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_preprocessor, num_cols),
    ('cat', cat_preprocessor, cat_cols)
])

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

param_grid = {
    "XGBmodel__n_estimators": [10, 50, 100, 500],
    "XGBmodel__learning_rate": [0.1, 0.5, 1],
}

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?

score 5 · Answer 1 · answered May 30 '19 at 08:07

There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.

The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.

Antoine Dubuis · Answer 2 · 2021-12-01T09:28:15.640

One way to train a pipeline that is using EarlyStopping is to train the preprocessing and the regressor separately.

The steps are the following:

fit_transform() the transformers
transform() the validation data.
fit() the model with Xgboost parameters
dump the fitted pipeline

as follows:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import pickle
import numpy as np
import joblib

rng = np.random.RandomState(0)
X_train, X_val = rng.randn(50, 3), rng.randn(20, 3)
y_train, y_val = rng.randn(50, 1), rng.randn(20, 1)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', XGBRegressor(random_state=0)),
])

X_train_transformed = pipeline[:-1].fit_transform(X_train)
x_val_transformed = pipeline[:-1].transform(X_val)

pipeline[-1].fit(
    X=X_train_transformed,
    y=y_train,
    eval_set=[(x_val_transformed, y_val)],
    early_stopping_rounds=10,
)

joblib.dump(pipeline, 'pipeline.pkl')
pipe = joblib.load('pipeline.pkl')
pipe.score(X_val, y_val)

Notes: This will work if you you want to fit the pipeline. However, if you want to perform a GridSearch using earlyStropping, you will have to write your own gridsearch like in this article.

XGboost: cannot pass validation data for eval_set in pipeline

2 Answers2

Linked