6

For a given model type, I want to both 1) tune parameters for various model types and 2) find the best tuned model type. I would like to use GridSearchCV for this.

I was able to run the following, but I am also concerned that this is not working the way I am expecting it to work, and I am also concerned that perhaps you do not need to do nested GridSearchCV - is it possible to do this using one GridSearchCV?

One concern I have with a nested GridSearchCV is that I might be doing nested cross validation as well, so instead of grid searching on 66% of the train data, it might be effectively grid searching on 43.56% of the train data. Another concern I have is that I have increased the code complexity.

Here's my nested GridSearchCV example using the iris dataset:

import numpy as np 
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import KernelPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

iris_raw_data = load_iris()
iris_df = pd.DataFrame(np.c_[iris_raw_data.data, iris_raw_data.target], 
                       columns=iris_raw_data.feature_names + ['target'])
iris_category_labels = {0:'setosa', 1:'versicolor', 2:'virginica'}
iris_df['species_name'] = iris_df['target'].apply(lambda l: iris_category_labels[int(l)])

features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target = 'target'
X_train, X_test, y_train, y_test = train_test_split(iris_df[features], iris_df[target], test_size=.33)

pipe_knn = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', KernelPCA(n_components=2)),    # select feature 2 and 4
    ('clf', KNeighborsClassifier())   
    ]) 
params_knn = dict(scaler=[None, StandardScaler()],
                  reduce_dim=[None, KernelPCA(n_components=2)],
                  clf__n_neighbors=[2, 5, 15]) 
grid_search_knn = GridSearchCV(pipe_knn, param_grid=params_knn)

pipe_svc = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', KernelPCA(n_components=2)),    # select feature 2 and 4
    ('clf', SVC())   
    ]) 
params_svc = dict(scaler=[None, StandardScaler()],
                  reduce_dim=[None, KernelPCA(n_components=2)],
                  clf__C=[0.1, 1, 10, 100]) 
grid_search_svc = GridSearchCV(pipe_svc, param_grid=params_svc)

pipe_rf = Pipeline(steps=[
    ('clf', RandomForestClassifier())   
    ]) 
params_rf = dict(clf__n_estimators=[10, 50, 100],
                 clf__min_samples_leaf=[2, 5, 10])

grid_search_rf = GridSearchCV(pipe_rf, param_grid=params_rf)

pipe_meta = Pipeline(steps=[('subpipes', pipe_knn)])
params_meta = dict(subpipes=[grid_search_svc, grid_search_knn, grid_search_rf])
grid_search_meta = GridSearchCV(pipe_meta, param_grid=params_meta)

grid_search_meta.fit(X_train, y_train)
print(grid_search_meta.best_estimator_)
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
  • 2
    What specifically is concerning you? You already seem to know about `best_estimator_`. – Arya McCarthy May 17 '17 at 21:54
  • You need to explain in detail what you were expecting and what is not working for you – Vivek Kumar May 18 '17 at 06:05
  • @aryamccarthy perhaps I got this completely right, but I want to know what is the correct way to GridSearch on a pipeline that could consist of completely different sub steps that each might contain completely different parameters that need to be tuned (e.g. a RandomForestClassifier vs a SVC). I have attempted to do this in a nested approach, but I haven't been able to find documentation anywhere to suggest that this is a correct design pattern, and while it seems to be outputting reasonable results, I am concerned that this approach does not work as I expect. – mgoldwasser May 18 '17 at 10:33
  • 1
    This is a fairly common approach to try different algorithms and tune them to see which algorithm is best suited for the data, just not used this way. Traditional approaches like a for loop for all algorithms are used commonly. In your case `grid_search_meta.best_estimator_` will give the best of the three approaches you used, and `grid_search_meta.best_estimator_.best_estimator_` will give the model from the above said best approach, which gave best results on your training data. – Vivek Kumar May 18 '17 at 12:34
  • 1
    @VivekKumar thank you, that is helpful! One quick correction, in my code above, the best estimator is a `Pipeline` containing a single `GridSearchCV` element, so it's actually `grid_search_meta.best_estimator_.steps[0][1].best_estimator_` – mgoldwasser May 18 '17 at 13:36
  • Aah yes. My bad. I oversaw that its a pipeline. – Vivek Kumar May 22 '17 at 09:47

0 Answers0