12

I want to build a Pipeline in sklearn and test different models using GridSearchCV.

Just an example (please do not pay attention on what particular models are chosen):

reg = LogisticRegression()

proj1 = PCA(n_components=2)
proj2 = MDS()
proj3 = TSNE()

pipe = [('proj', proj1), ('reg' , reg)]

pipe = Pipeline(pipe)

param_grid = {
    'reg__c': [0.01, 0.1, 1],
}

clf = GridSearchCV(pipe, param_grid = param_grid)

Here if I want to try different models for dimensionality reduction, I need to code different pipelines and compare them manually. Is there an easy way to do it?

One solution I came up with is define my own class derived from base estimator:

class Projection(BaseEstimator):
    def __init__(self, est_name):
        if est_name == "MDS":
            self.model = MDS()
        ...
    ...
    def fit_transform(self, X):
        return self.model.fit_transform(X)

I think it will work, I just create a Projection object and pass it to Pipeline, using names of the estimators as parameters for it.

But to me this way is a bit chaotic and not scalable: it makes me to define new class each time I want to compare different models. Also to continue this solution, one could implement a class that does the same job, but with arbitrary set of models. It seems overcomplicated to me.

What is the most natural and pythonic way to compare different models?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
sooobus
  • 841
  • 1
  • 9
  • 22

2 Answers2

20

Lets assume you want to use PCA and TruncatedSVD as your dimesionality reduction step.

pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]

You can do this:

pipe = Pipeline(steps=[('reduction', pca), ('svm', svm)])

# Change params_grid -> Instead of dict, make it a list of dict
# In the first element, pass parameters related to pca, and in second related to svd

params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':pca,
'reduction__n_components': n_components,
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':svd,
'reduction__n_components': n_components,
'reduction__algorithm':['randomized']
}]

and now just pass the pipeline object to gridsearchCV

grd = GridSearchCV(pipe, param_grid = params_grid)

Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

Please look at my other answer for more details: "Parallel" pipeline to get best model using gridsearch

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Did I understood correctly that one also needs to include 'reduction': algo_name into each of two elements of param_grid? Otherwise svd is not used in training classifers (if I understood your other answer, which worked for me, correctly). – sooobus May 10 '18 at 06:00
  • @sooobus Aah yes, it was a mistake from my part. Corrected now. Thanks. – Vivek Kumar May 10 '18 at 06:16
  • @VivekKumar Excellent answer! FYI, in scikit-learn 0.23 (and possibly in earlier versions), single values in a parameter grid need to be wrapped in a list with one element, otherwise it will error. Thus, you have to use 'reduction':[pca] and 'reduction':[svd] in order for it to work. – Kevin Markham Jul 29 '20 at 14:49
6

An alternative solution that does not require to prefix the estimators names in the parameter grid is the following:

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# the models that you want to compare
models = {
    'RandomForestClassifier': RandomForestClassifier(),
    'KNeighboursClassifier': KNeighborsClassifier(),
    'LogisticRegression': LogisticRegression()
}

# the optimisation parameters for each of the above models
params = {
    'RandomForestClassifier':{ 
            "n_estimators"      : [100, 200, 500, 1000],
            "max_features"      : ["auto", "sqrt", "log2"],
            "bootstrap": [True],
            "criterion": ['gini', 'entropy'],
            "oob_score": [True, False]
            },
    'KNeighboursClassifier': {
        'n_neighbors': np.arange(3, 15),
        'weights': ['uniform', 'distance'],
        'algorithm': ['ball_tree', 'kd_tree', 'brute']
        },
    'LogisticRegression': {
        'solver': ['newton-cg', 'sag', 'lbfgs'],
        'multi_class': ['ovr', 'multinomial']
        }  
}

and you can define:

from sklearn.model_selection import GridSearchCV

def fit(train_features, train_actuals):
        """
        fits the list of models to the training data, thereby obtaining in each 
        case an evaluation score after GridSearchCV cross-validation
        """
        for name in models.keys():
            est = models[name]
            est_params = params[name]
            gscv = GridSearchCV(estimator=est, param_grid=est_params, cv=5)
            gscv.fit(train_features, train_actuals)
            print("best parameters are: {}".format(gscv.best_estimator_))

basically running through the different models, each model referring to its own set of optimisation parameters through a dictionary. Of course do not forget to pass the models and the parameters dictionary to the fit function, in case you do not have them as global variables. Have a look at this GitHub project for a more complete overview.

gented
  • 1,620
  • 1
  • 16
  • 20
  • gscv.fit(train_actuals, train_features) <-- wrong way round I think – Maths12 Feb 05 '20 at 20:20
  • Once i loop through all my models to find best hyperparamaters using gridsearch do i use the model which gave highest "best score value"? e.g. if after doing the above i found that random forest best score= 0.8 and logsiticregression best score= 0.9 would take logistic regression? – Maths12 Feb 08 '20 at 19:08
  • In principle yes, you would take that one model and train it on the entire data set. Notice that however cross-validation scores are only an indication of the error that one can encounter against unknown test data. – gented Feb 10 '20 at 08:18
  • I think it is not a complete answer since it does not show how to nest the other steps of the pipeline. I have tested it a little bit, but could not find out with your syntax. – OuttaSpaceTime Nov 26 '21 at 19:49