0

I have big troubles implementing LightGBM on a extreme imbalanced dataset (using R)

Indeed, I'm dealing with a binary classification problem and the distibution of the target variable is about 1:800

( Approx: Class 0: 110 000 Class 1: 140 )

I have nearly 300 variables (which are summaries of dynamic variables over 12 months) and a couple of categorical variables.

In all what follows, my evaluation is the F1-score and the metric I use is the binary log-loss

I have tried 2 approaches: one with resampling techniques, one without.

Method of 1st approach

  1. First, I have decided to LabelEncode my categorical variables (because ADASYN does not takes categorical variables into account as input)

  2. I have tried different combination of SMOTE/ADAZYN & NearMiss/RandomUnderSampler to resample my training set

  3. I standardize my numerical variables

  4. I train my model on train set and predict on my validation set (without specifying the parameter scale_pos_weight for positive class in lgb.train)

  5. I obtain some very bad results:
    On train set: F1-score=0.5
    On test set: F1-score=0.04

Method of 2nd approach

Same as first one but I'm not using resampling techniques on my training set.
I only set scale_pos_weight = count(negative)/count(positive) ~ 800 in my case

I have tried to tune paramters but I feel like I'm missing something since F1-score on validation set is still around 0.02..

Do you have any idea on how I could improve my model?

Thanks a lot in advance for your help !

CCbs
  • 105
  • 3
  • if this is real world data, it's possible there is no explanatory power in your features – Nate Apr 28 '21 at 15:52
  • That's what I thought but I've done an explanatory analysis before implementing the model and we clearly can see some trends on observations that are in my positive class.. And yes it is real world data ^^ But you are right, it might be possible that my features don't have enough exlanatory power to predict on such an imbalanced dataset. Thanks nate – CCbs Apr 29 '21 at 08:33

0 Answers0