5

I have a slightly imbalanced dataset for a binary classification problem, with a positive to negative ratio of 0.6. I recently learned about the auc metric from this answer: https://stats.stackexchange.com/a/132832/128229, and decided to use it.

But I came across another link http://fastml.com/what-you-wanted-to-know-about-auc/ which claims that, the AUC-ROC is insensitive to class imbalance, and we should use AUC for a precision-recall curve.

The xgboost docs are not clear on which AUC they use, do they use AUC-ROC? Also the link mentions that AUC should only be used if you do not care about the probability and only care about the ranking.

However since i am using a binary:logistic objective i think i should care about probabilities since i have to set a threshold for my predictions.

The xgboost parameter tuning guide https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md also suggests an alternate method to handle class imbalance, by not balancing positive and negative samples and using max_delta_step = 1.

So can someone explain, when is the AUC preffered over the other method for xgboost to handle class imbalance. And if i am using AUC , what is the threshold i need to set for prediction or more generally how exactly should i use AUC for handling imbalanced binary classification problem in xgboost?

EDIT:

I also need to eliminate false positives more than false negatives, how can i achieve that, apart from simply varying the threshold, with binary:logistic objective?

Community
  • 1
  • 1
Vikash Balasubramanian
  • 2,921
  • 3
  • 33
  • 74

2 Answers2

0

According the xgboost parameters section in here there is auc and aucprwhere prstands for precision recall.

I would say you could build some intuition by running both approaches and see how the metrics behave. You can include multiple metric and even optimize with respect to whichever you prefer.

You can also monitor the false positive (rate) in each boosting round by creating custom metric.

Kots
  • 486
  • 1
  • 5
  • 21
0

XGboost chose to write AUC (Area under the ROC Curve), but some prefer to be more explicit and say AUC-ROC / ROC-AUC.

https://xgboost.readthedocs.io/en/latest/parameter.html

kodkirurg
  • 156
  • 8