1

I'am using Pipeline and GridSearchCV of scikit learn library.

I know that for example feature selection methods can be combined by FeatureUnion. In this case the results are concatenated. What I'am looking for is an or-functionality, such that grid search performs things in parallel and doesn't combine at the end.

In the (not valid) example below SelectKBest() + SVC() and VarianceThreshold() + SVC() should be executed.

pipeline = Pipeline([ 
    [('kbest', SelectKBest()), 
     ('variance', VarianceThreshold())], 
    ('svm', SVC()) 
]) 

parameters = { 
    'kbest__k': [3, 5], 
    'variance__threshold': [0.1, 0.2], 
    'svm__C': [1], 
    'svm__gamma': [0.1, 0.01] 
} 

grid_search = GridSearchCV(pipeline, parameters) 
grid_search.fit(X, y) 

If yes, can the same functionality be used to have multiple estimators?

M. Kruber
  • 11
  • 3
  • 1
    I don't know the answer, but a workaround is using a list of pipelines, `pipelines = [Pipeline([cleaner, ('svm', SVC())]) for cleaner in [('kbest', SelectKBest()), ('variance', VarianceThreshold())]]`. – Mephy Aug 05 '16 at 12:43
  • @Mephy thanks for your suggestion. The disadvantage in this case would be that you have to define a separate set of parameters in each iteration. – M. Kruber Aug 05 '16 at 12:58
  • Have a look at my answer on this question: http://stackoverflow.com/questions/23045318/scikit-grid-search-over-multiple-classifiers-python/34003326#34003326 – Stergios Aug 08 '16 at 07:38

1 Answers1

0

Here's how to do it:

  1. Using a list of dicts instead of single dict, similar to the example provided in sklearn official documentation. In a way, each dict is like an OR statement.
  2. For the steps to skip, just use [None] to skip it.

A working example:

from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVC
from sklearn.datasets import load_iris

pipeline = Pipeline([ 
    ('kbest', SelectKBest()), 
    ('variance', VarianceThreshold()), 
    ('svm', SVC()) 
]) 

iris = load_iris()
X = iris.data
y = iris.target

parameters = [
    {
        'variance': [None], 
        'kbest__k': [1, 2], 
        'svm__C': [1], 
        'svm__gamma': [0.1, 0.01] 
    },
    {
        'kbest': [None], 
        'variance__threshold': [0.1, 0.2], 
        'svm__C': [1], 
        'svm__gamma': [0.1, 0.01] 
    }
]

grid_search = GridSearchCV(pipeline, parameters) 
grid_search.fit(X, y) 
Alex Ramses
  • 538
  • 3
  • 19