Efficiently using the combination of sklearn's pipeline and gridSearchCV

Question

I've only been using sklearn for a short period of time and I'm currently trying to build a pipeline together with GridSearchCV. To my knowledge, the pipeline is particularly useful if you want to combine a known sequence of transformations together with a final estimator and keep the associated cross-validation consistent. However, I would like to additionally use the pipeline for arbitrarily combining a set of preprocessing steps with a set of estimators. Assume, for example, that I have one preprocessing object (e.g. StandardScaler()) and two estimator objects (e.g. Lasso() and PLSRegression()). To account for different possible combinations of these objects (here: StandardScaler() AND Lasso()) OR (StandardScaler() AND PLSRegression())), I construct the pipeline and associated GridSearchCV object as follows:

# Instantiate the pipeline with a sequence of steps
pipe = Pipeline(steps=[('preprocessor',None), ('estimator',None)])  

# Construct the parameter grid accounting for different combinations 
# of preprocessing and estimation steps
params = [{         
                    'estimator': [Lasso()],
                    'estimator__alpha': [0.5, 1],

                    'preprocessor': [StandardScaler()],
                    'preprocessor__with_mean': [True, False]                   
                    },       
            {
                    'estimator': [PLSRegression()],
                    'estimator__n_components': [1, 2],

                    'preprocessor': [StandardScaler()],
                    'preprocessor__with_mean': [True, False]                    
                    }]

# Instantiate GridSearchCV object
gs = GridSearchCV(estimator=pipe,
                  param_grid=params)

I found this solution in an earlier post "Parallel" pipeline to get best model using gridsearch. However, I find it somewhat cumbersome since either I have to define all feasible combinations of pipeline steps by hand or I have to write a hacky function that does that job for me. Especially, in a context with much more preprocessing steps and a larger variety of estimators this is not what I want to do. Is there a different, more simple, way to evaluate different combinations of preprocessors and estimators, preferably within the pipeline/GridSearchCV construction itself?

What I've done before is create wrapper classes for each transformer/estimator you possibly will use, then pass an `enabled` parameter to it during the grid search. In the wrapper, override the `fit()` and `transform()` calls to check for `self.enabled`. If wrapper is not `enabled`, these methods just `return self` or `return X` respectively. If `enabled`, they call the wrapped transformer or estimator. You'll also have to create `set_params()` for `GridSearchCV` to use. — Bert Kellerman, Jun 28 '18 at 17:55

Efficiently using the combination of sklearn's pipeline and gridSearchCV

0 Answers0