This issue has been discussed here but there has been no comments: https://github.com/scikit-learn/scikit-learn/issues/16473
I have some numerical features and categorical features in X. The categorical features were one hot encoded. So my pipeline is something similar to the sklearn docs example:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
But if I change passthrough=True, it will raise a TypeError because the passthrough gives the original X and skips the preprocessing part of the pipeline:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Is there anyway to make the passthrough include the first preprocessing part of the pipeline?
I also cannot add the preprocessing pipeline infront of the final estimator because it will concatenate the original X dataframe with the final layer predictions which is a numpy array as mentioned in the github discussion link at the top of this post. My exact preprocessing pipeline has several custom transformers that operates on pandas dataframe.
Thank you for any help!