Using GridSearchCV to tune the parameters but there is an error when I tried to fit the datasets

Question

This is a tutorial that im doing for Scikitlearn machine learning. I was using 3 different classifiers in Scikitlearn which is decision tree, logistic regression and KNearestNeighbors. The individual classifiers worked fine and I combined them together into a ensemble learning algo using MajorityVoting which is represented as mv_clf in the codes.

These are the results of the classifiers

10-fold cross validation: 

ROC AUC: 0.92 (+/-  0.15) [Logistic Regression]
ROC AUC: 0.87 (+/-  0.18) [Decision tree]
ROC AUC: 0.85 (+/-  0.13) [KNN]
Accuracy: 0.92 (+/-  0.15) [Logistic Regression]
Accuracy: 0.87 (+/-  0.18) [Decision tree]
Accuracy: 0.85 (+/-  0.13) [KNN]
Accuracy: 0.98 (+/-  0.05) [Majority voting]

However, when I tried GridSearchCV to tune the parameters as a tutorial, there was an error in the grid.fit() function. I searched the documentation of GridSearchCV but i failed to understand why it fails to fit, because the output of the GridSeachCV seems fine.

params = {'pipeline-1__clf__C': [0.001, 0.1, 100.0], 'decisiontreeclassifier__max_depth': [
    1, 2], 'pipeline-2__n_neighbors': [1, 2]}
grid = GridSearchCV(estimator=mv_clf, param_grid=params,
                    scoring='roc_auc', cv=10)
print(grid)
grid.fit(X_train, y_train)

print(grid) function output

GridSearchCV(cv=10,
             estimator=VotingClassifier(estimators=[('lr',
                                                     Pipeline(steps=[['sc',
                                                                      StandardScaler()],
                                                                     ['clf',
                                                                      LogisticRegression(C=0.001,
                                                                                         random_state=1)]])),
                                                    ('dt',
                                                     DecisionTreeClassifier(criterion='entropy',
                                                                            max_depth=1,
                                                                            random_state=0)),
                                                    ('KNN',
                                                     Pipeline(steps=[['sc',
                                                                      StandardScaler()],
                                                                     ['clf',
                                                                      KNeighborsClassifier(n_neighbors=1)]]))],
                                        voting='soft'),
             param_grid={'decisiontreeclassifier__max_depth': [1, 2],
                         'pipeline-1__clf__C': [0.001, 0.1, 100.0],
                         'pipeline-2__n_neighbors': [1, 2]},
             scoring='roc_auc')

The print grid function has a normal output but when I tried to grid.fit(), there is an error and I am not sure why.

These are the errors that was shown after grid.fit() is called

Traceback (most recent call last):
  File "/Users/cheokjiaheng/Documents/Coding Projects/Tutorials/Python Machine Learning Book/Combining Diff Models/MajorityVoting.py", line 115, in <module>
    grid.fit(X_train, y_train)
  ...
  ...
  ...
  File "/Users/cheokjiaheng/miniforge3/envs/tensorflowenv/lib/python3.8/site-packages/sklearn/base.py", line 230, in set_params
    raise ValueError('Invalid parameter %s for estimator %s. '
ValueError: Invalid parameter decisiontreeclassifier for estimator VotingClassifier(estimators=[('lr',
                              Pipeline(steps=[['sc', StandardScaler()],
                                              ['clf',
                                               LogisticRegression(C=0.001,
                                                                  random_state=1)]])),
                             ('dt',
                              DecisionTreeClassifier(criterion='entropy',
                                                     max_depth=1,
                                                     random_state=0)),
                             ('KNN',
                              Pipeline(steps=[['sc', StandardScaler()],
                                              ['clf',
                                               KNeighborsClassifier(n_neighbors=1)]]))],
                 voting='soft'). Check the list of available parameters with `estimator.get_params().keys()`.

Why, while you are (correctly) using `clf` to refer to your `LogisticRegression`, you then decide that, instead of similarly using `dt` to refer to your `DecisionTreeClassifier`, you will use `decisiontreeclassifier`? Is `decisiontreeclassifier` (case sensitive) defined anywhere in your code? Similarly for your `KNN` further down... — desertnaut, May 25 '21 at 20:44

afsharov · Answer 1 · 2021-05-26T08:58:44.640

0

This question refers to some extent to the problem stated here and is due to the specification you make in the parameter grid. When tuning the hyperparameters for a VotingClassifier, you have to specify the key for the estimator in the VotingClassfier followed by __ (two underscores) and then the attribute itself.

But, since two of your estimators are pipelines, you have to further specify the step, which is again separated by __ (two underscores). This means your parameter grid should look like this:

param_grid={
    'dt__max_depth': [1, 2],
    'lr__clf__C': [0.001, 0.1, 100.0],
    'KNN__clf__n_neighbors': [1, 2]
}

Again, for the DecisionTreeClassifier, you only specify its name that you chose when you instantiated the VotingClassifier object and the attribute name, and for the Pipeline, you need to specify its name, the step and then the attribute.

edited May 26 '21 at 08:58

answered May 25 '21 at 11:40

afsharov

4,774
2
10
27

If this is a duplicate, please flag it as such instead of answering it. – desertnaut May 25 '21 at 20:45
@desertnaut the answer I linked to does not address how to handle the pipeline objects. This is why I think it's not really a "true" duplicate and it is worth leaving an answer that resolves the OP's question completely and not just partially. Otherwise, I would agree with you and just flag it instead of answering. – afsharov May 25 '21 at 20:58
Thanks so much, it works now. Appreciate for the help and explanation. – Cheok Jia Heng May 26 '21 at 08:48

Using GridSearchCV to tune the parameters but there is an error when I tried to fit the datasets

1 Answers1