My training data has extremely class imbalanced
{0:872525,1:3335}
with 100 features.
I use xgboost
to build classification model with Bayesian optimisation to hypertune the model in range
{ learning rate:(0.001,0.1),
min_split_loss:(0.10),
max_depth:(3,70),
min_child_weight:(1:20),
max_delta_step:(1,20),
subsample:(0:1),
colsample_bytree:(0.5,1),
lambda:(0,10),
alpha:(0,10),
scale_pos_weight:(1,262),
n_estimator:(1,20)
}
I also use binary:logistics
as the objective model and roc_auc
as the metrics with booster gbtree
.
The cross validation score is 82.5%.
However, when I implemented the model to the testing data I got a score of only
Roc_auc: 75.2%,
pr_auc: 15%,
log_loss: 0.046
and confusion matrix:
[[19300 7],
[103 14]]
I need help to find the best way to increase the true positive to be around 60% with tolerance false positive until 3 times actual positive.