Is there a parameter for GridSearchCV to select the best with the lowest difference between train and test set?

Question

My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0)

The problem is, the GridSearchCV only consider the test set metrics. It disregard the train set metrics. Therefore, the result is still an overfitted model.

What I've did:

I tried to access the cv_results_ attribute, but there is tons of output, I am not sure how to read it, and I believe we are not supposed to do that manually.

The code

# model definition
rf_cv = GridSearchCV(estimator=rf_clf_default,
                     # what the user care is the model ability to find class 1.
                     scoring=make_scorer(score_func=f1_score, pos_label=1),
                     param_grid={'randomforestclassifier__n_estimators': [37,38,39,100,200],
                                 'randomforestclassifier__max_depth': [4,5,6,10,20,30],
                                 'randomforestclassifier__min_samples_leaf': [2,3,4]},
                     return_train_score=True,
                     refit=True)

# ignore OneHotEncoder warning about unkonwn categories 
with warnings.catch_warnings():
    warnings.simplefilter(action="ignore", category=UserWarning)
    # Train the algorithm
    rf_cv.fit(X=X_train, y=y_train)

# get the recall score for label 1
print("best recall score class 1", rf_cv.best_score_)

# get the best params
display("best parameters", rf_cv.best_params_)

@BenReiniger noted with thanks. – Jason Rich Darmawan Nov 15 '22 at 15:01 — Jason Rich Darmawan, Nov 15 '22 at 15:01

score 0 · Answer 1 · answered Nov 15 '22 at 15:40

You can provide a callable for the refit parameter:

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

For example, if you want to only consider hyperparameters whose train and test recall scores are within 0.05:

import pandas as pd

def my_refit_criteria(cv_results_):
    cv_frame = pd.DataFrame(cv_results_)
    candidate_mask = cv_frame['mean_train_recall'] - cv_frame['mean_test_recall'] < 0.05
    if candidate_mask.sum() > 0:
        candidates = cv_frame[candidate_mask]
    else:
        # if none, just pick the best
        candidates = cv_frame
    return candidates['mean_test_recall'].idxmax()

search = GridSearchCV(..., refit=my_refit_criteria)

(I haven't tested this; if you see errors let me know.)

There's a more complex example in the docs:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Is there a parameter for GridSearchCV to select the best with the lowest difference between train and test set?

1 Answers1