How to determine best parameters and best score for each scoring metric in GridSearchCV

Question

I am trying to evaluate multiple scoring metrics to determine the best parameters for model performance. i.e., to say:

To maximize F1, I should use these parameters. To maximize precision, I should use these parameters.

I am working off the following example from this sklearn page

import numpy as np

from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

X, y = make_hastie_10_2(n_samples=5000, random_state=42)


scoring = {'PRECISION': 'precision', 'F1': 'f1'}

gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid={'min_samples_split': range(2, 403, 10)},
                  scoring=scoring, refit='F1', return_train_score=True)
gs.fit(X, y)
best_params = gs.best_params_
best_estimator = gs.best_estimator_

print(best_params)
print(best_estimator)

Which yields:

{'min_samples_split': 62}
DecisionTreeClassifier(min_samples_split=62, random_state=42)

However, what I would be looking for would be to find these results for each metric, so in this case, for F1 and precision

How can I achieve getting the best parameters for each type of scoring metric in GridSearchCV?

Note - I believe it has something to do with my usage of refit='F1', but am not sure how to use multiple metrics there?

desertnaut · Accepted Answer · 2020-07-21T22:31:59.187

To do so, you'll have to dig into the detailed results of the whole grid search CV procedure; fortunately, these detailed results are returned in the cv_results_ attribute of the GridSearchCV object (docs).

I have rerun your code as-is, but I am not retyping it here; it suffices to say that, despite explicitly setting the random number generator's seed, I am getting a different final result (I guess due to different versions) as:

{'min_samples_split': 322}
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=322,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

but this is not important for the issue at hand here.

The easiest way to use the returned cv_results_ dictionary is to convert it to a pandas dataframe:

import pandas as pd
cv_results = pd.DataFrame.from_dict(gs.cv_results_)

Still, as it includes too much info (columns), I will further simplify it here to demonstrate the issue (feel free to explore it more fully yourself):

df = cv_results[['params', 'mean_test_PRECISION', 'rank_test_PRECISION', 'mean_test_F1', 'rank_test_F1']]

pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.set_option('expand_frame_repr', False)
print(df)

Result:

                        params  mean_test_PRECISION  rank_test_PRECISION  mean_test_F1  rank_test_F1
0     {'min_samples_split': 2}             0.771782                    1      0.763041            41
1    {'min_samples_split': 12}             0.768040                    2      0.767331            38
2    {'min_samples_split': 22}             0.767196                    3      0.776677            29
3    {'min_samples_split': 32}             0.760282                    4      0.773634            32
4    {'min_samples_split': 42}             0.754572                    8      0.777967            26
5    {'min_samples_split': 52}             0.754034                    9      0.777550            27
6    {'min_samples_split': 62}             0.758131                    5      0.773348            33
7    {'min_samples_split': 72}             0.756021                    6      0.774301            30
8    {'min_samples_split': 82}             0.755612                    7      0.768065            37
9    {'min_samples_split': 92}             0.750527                   10      0.771023            34
10  {'min_samples_split': 102}             0.741016                   11      0.769896            35
11  {'min_samples_split': 112}             0.740965                   12      0.765353            39
12  {'min_samples_split': 122}             0.731790                   13      0.763620            40
13  {'min_samples_split': 132}             0.723085                   14      0.768605            36
14  {'min_samples_split': 142}             0.713345                   15      0.774117            31
15  {'min_samples_split': 152}             0.712958                   16      0.776721            28
16  {'min_samples_split': 162}             0.709804                   17      0.778287            24
17  {'min_samples_split': 172}             0.707080                   18      0.778528            22
18  {'min_samples_split': 182}             0.702621                   19      0.778516            23
19  {'min_samples_split': 192}             0.697630                   20      0.778103            25
20  {'min_samples_split': 202}             0.693011                   21      0.781047            10
21  {'min_samples_split': 212}             0.693011                   21      0.781047            10
22  {'min_samples_split': 222}             0.693011                   21      0.781047            10
23  {'min_samples_split': 232}             0.692810                   24      0.779705            13
24  {'min_samples_split': 242}             0.692810                   24      0.779705            13
25  {'min_samples_split': 252}             0.692810                   24      0.779705            13
26  {'min_samples_split': 262}             0.692810                   24      0.779705            13
27  {'min_samples_split': 272}             0.692810                   24      0.779705            13
28  {'min_samples_split': 282}             0.692810                   24      0.779705            13
29  {'min_samples_split': 292}             0.692810                   24      0.779705            13
30  {'min_samples_split': 302}             0.692810                   24      0.779705            13
31  {'min_samples_split': 312}             0.692810                   24      0.779705            13
32  {'min_samples_split': 322}             0.688417                   33      0.782772             1
33  {'min_samples_split': 332}             0.688417                   33      0.782772             1
34  {'min_samples_split': 342}             0.688417                   33      0.782772             1
35  {'min_samples_split': 352}             0.688417                   33      0.782772             1
36  {'min_samples_split': 362}             0.688417                   33      0.782772             1
37  {'min_samples_split': 372}             0.688417                   33      0.782772             1
38  {'min_samples_split': 382}             0.688417                   33      0.782772             1
39  {'min_samples_split': 392}             0.688417                   33      0.782772             1
40  {'min_samples_split': 402}             0.688417                   33      0.782772             1

The names of the columns should be self-explanatory; they include the parameters tried, the score for each one of the metrics used, and the corresponding rank (1 meaning the best). You can immediately see, for example, that, despite the fact that 'min_samples_split': 322 gives indeed the best F1 score, it is not the only parameter setting that does so, and there are many more settings that also give the best F1 score and a respective rank_test_F1 of 1 in the results.

From this point, it is trivial to get the info you want; for example, here are the best models for each one of your two metrics:

print(df.loc[df['rank_test_PRECISION']==1]) # best precision
# result:
                     params  mean_test_PRECISION  rank_test_PRECISION  mean_test_F1  rank_test_F1
0  {'min_samples_split': 2}             0.771782                    1      0.763041            41

print(df.loc[df['rank_test_F1']==1]) # best F1
# result:
                        params  mean_test_PRECISION  rank_test_PRECISION  mean_test_F1  rank_test_F1
32  {'min_samples_split': 322}             0.688417                   33      0.782772             1
33  {'min_samples_split': 332}             0.688417                   33      0.782772             1
34  {'min_samples_split': 342}             0.688417                   33      0.782772             1
35  {'min_samples_split': 352}             0.688417                   33      0.782772             1
36  {'min_samples_split': 362}             0.688417                   33      0.782772             1
37  {'min_samples_split': 372}             0.688417                   33      0.782772             1
38  {'min_samples_split': 382}             0.688417                   33      0.782772             1
39  {'min_samples_split': 392}             0.688417                   33      0.782772             1
40  {'min_samples_split': 402}             0.688417                   33      0.782772             1

just for understanding's sake...this would read that `min_samples_split: 2` was the best hyperparameter for maximizing precision, even with `refit=F1`? — artemis, Jul 20 '20 at 16:59
@wundermahn exactly; going through the `df`, you can easily confirm that the respective precision value of 0.771782 is the maximum one indeed. What you specify in `refit` determines what the process will return as `best_params` and `best_estimator` (that's why here you got the parameters that maximize F1, and not precision), since it should be apparent that you cannot optimize for more than one metric *simultaneously*. — desertnaut, Jul 20 '20 at 20:06

How to determine best parameters and best score for each scoring metric in GridSearchCV

1 Answers1