scikit pipeline is not proceeded correctly with GridsearchCV

Question

I am trying to feed a dataset with categorical and numerical variable. So I one hot encode the categorical features and input it into a pipeline used in gridsearchCV. The error is at the last row when I try to fit the model. My understanding is it does not perform the job to go through the pipeline before to fit the model as it gives type error on the column name BEFORE encoding. What should be the correct process?

The error:

TypeError: '['First' 'Second' 'Third']' is an invalid key

My code:

y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.90, random_state=2, shuffle=True
)

categorical_columns = [
    "first",
    "second",
    "third"]
numerical_columns = [
    "fourth",
    "thith", 
    "sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()

preprocessing = ColumnTransformer(
    [('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
     ('num', 'passthrough', enc_sample[numerical_columns])])

pipe = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', GradientBoostingRegressor())
])

cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)

search_grid = {
    "classifier__n_estimators": [100],
    "classifier__learning_rate": [0.1],
    "classifier__max_depth": [5],
    "classifier__min_samples_leaf":[8],
    "classifier__subsample":[0.6]
}
search = GridSearchCV(
    estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)

As a reference, I used the official doc as follow: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

score 1 · Accepted Answer · answered Jul 01 '21 at 12:48

It looks like your column transformer is not selecting the categorical and numerical columns. You can fix that by using sklearn.compose.make_column_selector to select data based on their types.

You can use it as follow:

from sklearn.compose import make_column_selector
preprocessing = ColumnTransformer(
    [('cat', categorical_encoder, make_column_selector(dtype_include=object)),
     ('num', 'passthrough', make_column_selector(dtype_exclude=object))])

scikit pipeline is not proceeded correctly with GridsearchCV

1 Answers1