1

My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0)

The problem is, the GridSearchCV only consider the test set metrics. It disregard the train set metrics. Therefore, the result is still an overfitted model.

What I've did:

  1. I tried to access the cv_results_ attribute, but there is tons of output, I am not sure how to read it, and I believe we are not supposed to do that manually.

The code

# model definition
rf_cv = GridSearchCV(estimator=rf_clf_default,
                     # what the user care is the model ability to find class 1.
                     scoring=make_scorer(score_func=f1_score, pos_label=1),
                     param_grid={'randomforestclassifier__n_estimators': [37,38,39,100,200],
                                 'randomforestclassifier__max_depth': [4,5,6,10,20,30],
                                 'randomforestclassifier__min_samples_leaf': [2,3,4]},
                     return_train_score=True,
                     refit=True)

# ignore OneHotEncoder warning about unkonwn categories 
with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=UserWarning)
    # Train the algorithm
    rf_cv.fit(X=X_train, y=y_train)

# get the recall score for label 1
print("best recall score class 1", rf_cv.best_score_)

# get the best params
display("best parameters", rf_cv.best_params_)

best params considered by GridSearchCV

Jason Rich Darmawan
  • 1,607
  • 3
  • 14
  • 31

1 Answers1

0

You can provide a callable for the refit parameter:

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

For example, if you want to only consider hyperparameters whose train and test recall scores are within 0.05:

import pandas as pd

def my_refit_criteria(cv_results_):
    cv_frame = pd.DataFrame(cv_results_)
    candidate_mask = cv_frame['mean_train_recall'] - cv_frame['mean_test_recall'] < 0.05
    if candidate_mask.sum() > 0:
        candidates = cv_frame[candidate_mask]
    else:
        # if none, just pick the best
        candidates = cv_frame
    return candidates['mean_test_recall'].idxmax()

search = GridSearchCV(..., refit=my_refit_criteria)

(I haven't tested this; if you see errors let me know.)

There's a more complex example in the docs:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29