I've only been using sklearn for a short period of time and I'm currently trying to build a pipeline together with GridSearchCV. To my knowledge, the pipeline is particularly useful if you want to combine a known sequence of transformations together with a final estimator and keep the associated cross-validation consistent. However, I would like to additionally use the pipeline for arbitrarily combining a set of preprocessing steps with a set of estimators. Assume, for example, that I have one preprocessing object (e.g. StandardScaler()) and two estimator objects (e.g. Lasso() and PLSRegression()). To account for different possible combinations of these objects (here: StandardScaler() AND Lasso()) OR (StandardScaler() AND PLSRegression())), I construct the pipeline and associated GridSearchCV object as follows:
# Instantiate the pipeline with a sequence of steps
pipe = Pipeline(steps=[('preprocessor',None), ('estimator',None)])
# Construct the parameter grid accounting for different combinations
# of preprocessing and estimation steps
params = [{
'estimator': [Lasso()],
'estimator__alpha': [0.5, 1],
'preprocessor': [StandardScaler()],
'preprocessor__with_mean': [True, False]
},
{
'estimator': [PLSRegression()],
'estimator__n_components': [1, 2],
'preprocessor': [StandardScaler()],
'preprocessor__with_mean': [True, False]
}]
# Instantiate GridSearchCV object
gs = GridSearchCV(estimator=pipe,
param_grid=params)
I found this solution in an earlier post "Parallel" pipeline to get best model using gridsearch. However, I find it somewhat cumbersome since either I have to define all feasible combinations of pipeline steps by hand or I have to write a hacky function that does that job for me. Especially, in a context with much more preprocessing steps and a larger variety of estimators this is not what I want to do. Is there a different, more simple, way to evaluate different combinations of preprocessors and estimators, preferably within the pipeline/GridSearchCV construction itself?