I am trying to feed a dataset with categorical and numerical variable. So I one hot encode the categorical features and input it into a pipeline used in gridsearchCV. The error is at the last row when I try to fit the model. My understanding is it does not perform the job to go through the pipeline before to fit the model as it gives type error on the column name BEFORE encoding. What should be the correct process?
The error:
TypeError: '['First' 'Second' 'Third']' is an invalid key
My code:
y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.90, random_state=2, shuffle=True
)
categorical_columns = [
"first",
"second",
"third"]
numerical_columns = [
"fourth",
"thith",
"sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
('num', 'passthrough', enc_sample[numerical_columns])])
pipe = Pipeline([
('preprocess', preprocessing),
('classifier', GradientBoostingRegressor())
])
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)
search_grid = {
"classifier__n_estimators": [100],
"classifier__learning_rate": [0.1],
"classifier__max_depth": [5],
"classifier__min_samples_leaf":[8],
"classifier__subsample":[0.6]
}
search = GridSearchCV(
estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)
As a reference, I used the official doc as follow: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html