Is there an "Or"-functionality in scikit learns pipelines

Question

I'am using Pipeline and GridSearchCV of scikit learn library.

I know that for example feature selection methods can be combined by FeatureUnion. In this case the results are concatenated. What I'am looking for is an or-functionality, such that grid search performs things in parallel and doesn't combine at the end.

In the (not valid) example below SelectKBest() + SVC() and VarianceThreshold() + SVC() should be executed.

pipeline = Pipeline([ 
    [('kbest', SelectKBest()), 
     ('variance', VarianceThreshold())], 
    ('svm', SVC()) 
]) 

parameters = { 
    'kbest__k': [3, 5], 
    'variance__threshold': [0.1, 0.2], 
    'svm__C': [1], 
    'svm__gamma': [0.1, 0.01] 
} 

grid_search = GridSearchCV(pipeline, parameters) 
grid_search.fit(X, y)

If yes, can the same functionality be used to have multiple estimators?

I don't know the answer, but a workaround is using a list of pipelines, `pipelines = [Pipeline([cleaner, ('svm', SVC())]) for cleaner in [('kbest', SelectKBest()), ('variance', VarianceThreshold())]]`. — Mephy, Aug 05 '16 at 12:43
@Mephy thanks for your suggestion. The disadvantage in this case would be that you have to define a separate set of parameters in each iteration. — M. Kruber, Aug 05 '16 at 12:58
Have a look at my answer on this question: http://stackoverflow.com/questions/23045318/scikit-grid-search-over-multiple-classifiers-python/34003326#34003326 — Stergios, Aug 08 '16 at 07:38

Alex Ramses · Answer 1 · 2019-10-31T03:06:57.730

Here's how to do it:

Using a list of dicts instead of single dict, similar to the example provided in sklearn official documentation. In a way, each dict is like an OR statement.
For the steps to skip, just use [None] to skip it.

A working example:

from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVC
from sklearn.datasets import load_iris

pipeline = Pipeline([ 
    ('kbest', SelectKBest()), 
    ('variance', VarianceThreshold()), 
    ('svm', SVC()) 
]) 

iris = load_iris()
X = iris.data
y = iris.target

parameters = [
    {
        'variance': [None], 
        'kbest__k': [1, 2], 
        'svm__C': [1], 
        'svm__gamma': [0.1, 0.01] 
    },
    {
        'kbest': [None], 
        'variance__threshold': [0.1, 0.2], 
        'svm__C': [1], 
        'svm__gamma': [0.1, 0.01] 
    }
]

grid_search = GridSearchCV(pipeline, parameters) 
grid_search.fit(X, y)

Is there an "Or"-functionality in scikit learns pipelines

1 Answers1