My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0)
The problem is, the GridSearchCV only consider the test set metrics. It disregard the train set metrics. Therefore, the result is still an overfitted model.
What I've did:
- I tried to access the
cv_results_
attribute, but there is tons of output, I am not sure how to read it, and I believe we are not supposed to do that manually.
The code
# model definition
rf_cv = GridSearchCV(estimator=rf_clf_default,
# what the user care is the model ability to find class 1.
scoring=make_scorer(score_func=f1_score, pos_label=1),
param_grid={'randomforestclassifier__n_estimators': [37,38,39,100,200],
'randomforestclassifier__max_depth': [4,5,6,10,20,30],
'randomforestclassifier__min_samples_leaf': [2,3,4]},
return_train_score=True,
refit=True)
# ignore OneHotEncoder warning about unkonwn categories
with warnings.catch_warnings():
warnings.simplefilter(action="ignore", category=UserWarning)
# Train the algorithm
rf_cv.fit(X=X_train, y=y_train)
# get the recall score for label 1
print("best recall score class 1", rf_cv.best_score_)
# get the best params
display("best parameters", rf_cv.best_params_)