-1

My training data has extremely class imbalanced

{0:872525,1:3335}

with 100 features. I use xgboost to build classification model with Bayesian optimisation to hypertune the model in range

{ learning rate:(0.001,0.1), 
  min_split_loss:(0.10), 
  max_depth:(3,70), 
  min_child_weight:(1:20), 
  max_delta_step:(1,20), 
  subsample:(0:1),  
  colsample_bytree:(0.5,1), 
  lambda:(0,10), 
  alpha:(0,10), 
  scale_pos_weight:(1,262), 
  n_estimator:(1,20)
}

I also use binary:logistics as the objective model and roc_auc as the metrics with booster gbtree. The cross validation score is 82.5%. However, when I implemented the model to the testing data I got a score of only

Roc_auc: 75.2%, 
pr_auc: 15%, 
log_loss: 0.046

and confusion matrix:

[[19300 7],
[103 14]]

I need help to find the best way to increase the true positive to be around 60% with tolerance false positive until 3 times actual positive.

Oscar
  • 460
  • 3
  • 18
zonna
  • 46
  • 1
  • 9
  • Please see "[ask]", "[Stack Overflow question checklist](https://meta.stackoverflow.com/questions/260648)" and all their linked pages, along with "[How do I format my posts...](https://stackoverflow.com/help/formatting)". Properly formatting your question helps us help you, and helps others understand what your question is about when they're looking for a solution also. Also read "[mre]". We need example code that runs and demonstrates the problem you're having. As is, it looks like you want us to write the code for you. – the Tin Man Mar 21 '22 at 20:54

1 Answers1

0

You mentioned that your dataset is very imbalanced.

I'd recommend looking at "imbalanced-learn", which is

a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance.

These techniques include, for example, over and under sampling.

You can find out more in the full documentation and examples.

If you are working on this dataset in a company, you can also investigate getting more data or pruning your dataset using rules/heuristics.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
neal
  • 343
  • 3
  • 10
  • I know the oversampling method, however my data is 750k rows with 320 features now so I prefer class weight to prevent the increasing rows. Do you have any other idea? https://datascience.stackexchange.com/questions/92776/high-recall-but-too-low-precision-result-in-imbalanced-data – zonna Apr 09 '21 at 07:57
  • 1
    Without having context about the business problem, it's a bit tough to give further recommendations. E.g., sometimes you can balance a dataset by reframing the problem - which could allow you to remove some instances based on rules. Inspecting your data is a great way to start. – neal Apr 10 '21 at 14:58
  • 1
    See "[Don't use "click here" as link text](https://www.w3.org/QA/Tips/noClickHere)". – the Tin Man Mar 21 '22 at 20:48