44

Is there a better inbuilt way to do grid search and test multiple models in a single pipeline? Of course the parameters of the models would be different, which made is complicated for me to figure this out. Here is what I did:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV


def grid_search():
    pipeline1 = Pipeline((
    ('clf', RandomForestClassifier()),
    ('vec2', TfidfTransformer())
    ))

    pipeline2 = Pipeline((
    ('clf', KNeighborsClassifier()),
    ))

    pipeline3 = Pipeline((
    ('clf', SVC()),
    ))

    pipeline4 = Pipeline((
    ('clf', MultinomialNB()),
    ))
    
    parameters1 = {
    'clf__n_estimators': [10, 20, 30],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_features': [5, 10, 15],
    'clf__max_depth': ['auto', 'log2', 'sqrt', None]
    }

    parameters2 = {
    'clf__n_neighbors': [3, 7, 10],
    'clf__weights': ['uniform', 'distance']
    }

    parameters3 = {
    'clf__C': [0.01, 0.1, 1.0],
    'clf__kernel': ['rbf', 'poly'],
    'clf__gamma': [0.01, 0.1, 1.0],

    }
    parameters4 = {
    'clf__alpha': [0.01, 0.1, 1.0]
    }

    pars = [parameters1, parameters2, parameters3, parameters4]
    pips = [pipeline1, pipeline2, pipeline3, pipeline4]
    
    print "starting Gridsearch"
    for i in range(len(pars)):
        gs = GridSearchCV(pips[i], pars[i], verbose=2, refit=False, n_jobs=-1)
        gs = gs.fit(X_train, y_train)
        print "finished Gridsearch"
        print gs.best_score_

However, this approach is still giving the best model within each classifier, and not comparing between classifiers.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Aks
  • 932
  • 2
  • 17
  • 32
  • 1
    There's no automatic way to do this. – Fred Foo Apr 14 '14 at 07:47
  • 1
    yet ;) [the problem is that we can not set the "steps" of the pipeline, right?] – Andreas Mueller Apr 14 '14 at 18:06
  • @AndreasMueller; sorry didn't address this earlier. Can you elaborate what you meant there ? – Aks Jan 08 '15 at 12:21
  • 1
    Well you can not switch the Pipeline steps using the parameter grid. – Andreas Mueller Jan 08 '15 at 21:44
  • 1
    is this been changed/updated with this functionality? – Alessandro Oct 14 '16 at 10:34
  • The post [Hyperparameter Grid Search across multiple models in scikit-learn](http://www.davidsbatista.net/blog/2018/02/23/model_optimization/) (by David S. Batista) offers an updated implementation of an `EstimatorSelectionHelper` estimator which can run different estimators, each with its own grid of parameters. – dubek Jan 15 '17 at 11:14
  • this solution worked best for my, I only had to do some small changes to run on Python3 and with the latest versions of scikit-learn 0.19, code is available here: http://davidsbatista.net/blog/2018/02/23/model_optimization/ – David Batista Feb 24 '18 at 11:47
  • 4
    Isn't [this](https://stackoverflow.com/a/51629917/10161091) the answer? – SaTa Aug 18 '20 at 18:08

5 Answers5

24

Although the solution from dubek is more straight forward, it does not help with interactions between parameters of pipeline elements that come before the classfier. Therefore, I have written a helper class to deal with it, and can be included in the default Pipeline setting of scikit. A minimal example:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from pipelinehelper import PipelineHelper

iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
pipe = Pipeline([
    ('scaler', PipelineHelper([
        ('std', StandardScaler()),
        ('max', MaxAbsScaler()),
    ])),
    ('classifier', PipelineHelper([
        ('svm', LinearSVC()),
        ('rf', RandomForestClassifier()),
    ])),
])

params = {
    'scaler__selected_model': pipe.named_steps['scaler'].generate({
        'std__with_mean': [True, False],
        'std__with_std': [True, False],
        'max__copy': [True],  # just for displaying
    }),
    'classifier__selected_model': pipe.named_steps['classifier'].generate({
        'svm__C': [0.1, 1.0],
        'rf__n_estimators': [100, 20],
    })
}
grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=1)
grid.fit(X_iris, y_iris)
print(grid.best_params_)
print(grid.best_score_)

It can also be used for other elements of the pipeline, not just the classifier. Code is on github if anyone wants to check it out.

Edit: I have published this on PyPI if anyone is interested, just install ti using pip install pipelinehelper.

bmurauer
  • 989
  • 8
  • 24
10

Instead of using Grid Search for hyperparameter selection, you can use the 'hyperopt' library.

Please have a look at section 2.2 of this page. In the above case, you can use an hp.choice expression to select among the various pipelines and then define the parameter expressions for each one separately.

In your objective function, you need to have a check depending on the pipeline chosen and return the CV score for the selected pipeline and parameters (possibly via cross_val_score).

The trials object at the end of the execution, will indicate the best pipeline and parameters overall.

tkburis
  • 101
  • 1
  • 13
Stergios
  • 3,126
  • 6
  • 33
  • 55
7

This is how I did it without a wrapper function. You can evaluate any number of classifiers. Each one can have multiple parameters for hyperparameter optimization.

The one with best score will be saved to disk using pickle

from sklearn.svm import SVC
from operator import itemgetter
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
import operator
#pipeline parameters
    parameters = \
        [ \
            {
                'clf': [MultinomialNB()],
                'tf-idf__stop_words': ['english', None],
                'clf__alpha': [0.001, 0.1, 1, 10, 100]
            },

            {
                'clf': [SVC()],
                'tf-idf__stop_words': ['english', None],
                'clf__C': [0.001, 0.1, 1, 10, 100, 10e5],
                'clf__kernel': ['linear', 'rbf'],
                'clf__class_weight': ['balanced'],
                'clf__probability': [True]
            },

            {
                'clf': [DecisionTreeClassifier()],
                'tf-idf__stop_words': ['english', None],
                'clf__criterion': ['gini','entropy'],
                'clf__splitter': ['best','random'],
                'clf__class_weight':['balanced', None]
            }
        ]

    #evaluating multiple classifiers
    #based on pipeline parameters
    #-------------------------------
    result=[]

    for params in parameters:

        #classifier
        clf = params['clf'][0]

        #getting arguments by
        #popping out classifier
        params.pop('clf')

        #pipeline
        steps = [('tf-idf', TfidfVectorizer()), ('clf',clf)]

        #cross validation using
        #Grid Search
        grid = GridSearchCV(Pipeline(steps), param_grid=params, cv=3)
        grid.fit(features, labels)

        #storing result
        result.append\
        (
            {
                'grid': grid,
                'classifier': grid.best_estimator_,
                'best score': grid.best_score_,
                'best params': grid.best_params_,
                'cv': grid.cv
            }
        )

    #sorting result by best score
    result = sorted(result, key=operator.itemgetter('best score'),reverse=True)

    #saving best classifier
    grid = result[0]['grid']
    joblib.dump(grid, 'classifier.pickle')

E. Turok
  • 106
  • 1
  • 7
Tarun Pathak
  • 247
  • 4
  • 4
6

Another Simple solution to the problem.

First load all the estimators. Here I will be using classifiers mostly.

logi=LogisticRegression(penalty="elasticnet",l1_ratio=0.5,solver="saga", random_state=4, n_jobs=-1)
rf=RandomForestClassifier(random_state=4, n_jobs=-1, max_features="auto", warm_start=True)
gb=GradientBoostingClassifier(random_state=4, subsample=0.8, max_features="auto", warm_start=True)
svc=SVC(random_state=4, kernel='rbf')
ex=ExtraTreesClassifier(random_state=4, n_jobs=-1, max_features="auto", warm_start=True)

After that create a list of classifiers:

ensemble_clf=[rf, ex, gb, svc] 

Now, Create all parameters for each classifier/estimator:-

params1={"max_depth": range(5,30,5), "min_samples_leaf": range(1,30,2),
         "n_estimators":range(100,2000,200)}
params2={"criterion":["gini", "entropy"],"max_depth": range(5,30,5), 
         "min_samples_leaf": range(1,30,2), "n_estimators":range(100,2000,200)}
params3={"learning_rate":[0.001,0.01,0.1], "n_estimators":range(1000,3000,200)}
params4={"kernel":["rbf", "poly"], "gamma": ["auto", "scale"], "degree":range(1,6,1)}

Now create a list of them:

parameters_list=[params1, params2, params3, params4]

Now, comes the most important part: Create a string names for all the models/classifiers or estimators: This is used to create the Dataframes for comparison

model_log=["_rf", "_ex", "_gb", "_svc"]

Now run a for loop and use the Grid search:

for i in range(len(ensemble_clf)):
    Grid=GridSearchCV(estimator=ensemble_clf[i], param_grid=parameters_list[i], 
                      n_jobs=-1, cv=3, verbose=3).fit(TrainX_Std, TrainY)
    globals()['Grid%s' % model_log[i]]=pd.DataFrame(Grid.cv_results_)  

The "globals()['Grid%s' % model_log[i]]=pd.DataFrame(Grid.cv_results_) " will create dataframes individually for each of the estimators used and can be used to comparison by Sorting and picking up the best parameters of each and every estimator.

Hope this helps.

Rupanjan Nayak
  • 136
  • 1
  • 7
  • 2
    Liked the approach - pretty neat. However, I would use a dict of models to make it more readable, and instead of creating data frames for every model grid search, **inside the for loop I 'd do a `Grid.best_estimator_` to get the best estimator found for a especific model :)** https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#best_estimator_ – onofricamila Apr 16 '20 at 12:43
3

Another option is to use the HyperclassifierSearch (Github) package. It is close to the solution of bmurauer above.

However, you might

  1. find the DataFrame output for the best model helpful which skips timing info as the default
  2. find the three usage examples helpful
  3. like the shorter core code with around 100 lines

I developed the HyperclassifierSearch package (start with a pip install HyperclassifierSearch) based on the code from David Batista which I liked for the conciseness of the code.

Detail for 1., usage of the hyperclassifier evaluate_model function:

search = HyperclassifierSearch(models, params)
best_model = search.train_model(X, y)
search.evaluate_model(sort_by='mean_test_score', show_timing_info=False) # default parameters explicitly given
user3070843
  • 145
  • 2
  • 9
  • how do you specify the scoring metric? – Maths12 Aug 06 '21 at 15:20
  • Thanks for sharing this. Re @Maths12, you can pass `scoring` as in sklearn gridsearchcv to the `train_model` method, e.g. `scoring=["f1", "precision"]`. If you pass a string it will work fine, but if you want to pass a list (as in my example) then the code needs a small change in `evaluate_model`. – Marc Torrellas Socastro Sep 05 '21 at 21:18