Why a roc auc score from a regular cross-validation is very different from a roc auc score after an hyperparameter tuning?

Question

I'm evaluating a XGBoost classifier. I split the dataset into train and validation sets, perform a cross-validation with the model default implementation using the train set and compute the ROC AUC:

xgbClassCV = XGBClassifier()
kfold = StratifiedKFold(n_splits = 5)
auc = cross_val_score(xgbClassCV, x_train, y_train, scoring = "roc_auc", cv = kfold)
auc_avg = auc.mean()

The ROC AUC (auc_avg) is roughly 0.76.

I then perform a hyperparameter tuning through a randomized cross-validation using the train set:

xgbGrid = {..., ..., ..., ...}
xgbClassHT = XGBClassifier()
kfold = StratifiedKFold(n_splits = 5)
xgbClassRand = RandomizedSearchCV(estimator = xgbClassHT, param_distributions = xgbGrid, n_iter = 60, \
    cv = kfold, n_jobs = -1, verbose = 2)
xgbClassRand.fit(x_train, y_train)

I retrieve the best parameters, train a XGBoost classifier with such parameters, make predictions with the validation set and compute the ROC AUC:

xgbClassFT = XGBClassifier(..., ..., ..., ...)
xgbClassFT.fit(x_train, y_train)
predictions = xgbClassFT.predict(x_val)
auc = metrics.roc_auc_score(y_val, predictions)

This ROC AUC (auc) is roughly 0.65, 11 points lower than the one above. I find this puzzling since this doesn't happen with other scores I compute such as accuracy, precision, recall and F1: they remain fairly the same.

I repeat this process for a logistic regression and the same thing happens: an approx. 11-point difference between the ROC AUCs while the other scores remain fairly the same.

Any help to understand what could be going on here would be really appreciated. Thanks!

The default metric when running the grid search is `accuracy`. If I run the grid search to compute `roc_auc`, the search delivers a best `roc_auc` of around **0.76**. So, the grid search delivers similar results to the regular cross-validation. Then, when using the fined-tuned model to make predictions, I see this puzzling reduction in the `roc_auc` — Jaime Andrés Castañeda, Mar 03 '23 at 00:48
In the regular cross-validation, if instead I make predictions using `cross_val_predict()` and then compute the `roc_auc`, this score would be roughly **0.65**. It seems the differences I'm reporting could be the result of whether we're training or making predictions. However, these differences are still rather puzzling since the other scores remain fairly the same. — Jaime Andrés Castañeda, Mar 03 '23 at 16:05

score 1 · Answer 1 · answered Mar 07 '23 at 20:22

For classifiers following the scikit-learn API, predict returns integer-encoded predicted classes, not probabilties -- the case of binary data, it rounds to zero or one -- so you should not use it's predictions to calculate roc-auc scores.

If you data is binary, you can use predict_proba(x_val)[:, 1] to return the estimated probability of the positive class, which can be used with roc_auc_score

Why a roc auc score from a regular cross-validation is very different from a roc auc score after an hyperparameter tuning?

1 Answers1