Running LightGBM algorithm after the implementation of SMOTE function to mitigate the issue of managing imbalanced dataset

Question

I need to run a lightGBM model with an imbalanced dataset. The dataset has a 'Target' variable with a binary result, "0" with 61471 registers and "1" with 4456 registers. To mitigate the problem of imbalance dataset I run the SMOTE Function in the train dataset as it described:

SMOTE function in train dataset.

[tag: trainSMOTE <- SMOTE(target ~ ., train, perc.over = 400, k = 5)]. 
   [tag:dim(trainSMOTE)].

57928 67.

Dataset balanced after SMOTE function execution.

 [tag:table(trainSMOTE$target)].

0     1

35648 22280.

Problem. But when I run the LightGBM model, I have a problem due to as a result of this run, I have ONE class variable with Number of positive: 57928, number of negative: 0. Below you could see part of the script.

Create LightGBM datasets.

 [tag:train_data <- lgb.Dataset(data.matrix(trainSMOTE[, -9]), label = trainSMOTE[,    trainSMOTE$target]). 
    [tag:test_data <- lgb.Dataset(data.matrix(test[, -9]), label = test[, test$target])].

Define the parameters.

[tag:params <- list(
objective = "binary",
metric = 'auc', 
boosting_type = "gbdt",
num_leaves = 100, 
learning_rate = 0.05,
feature_fraction = 0.9,
bagging_fraction = 0.8,
bagging_freq = 5,
min_data_in_leaf = 50,
max_depth = -1,
verbose = -1)].

Run the model.

[tag: model <- lgb.train(params = params,
data = train_data,
valids = list(test = test_data),
early_stopping_rounds = 50)].

[LightGBM] [Warning] Contains only one class[LightGBM].
[Info] Number of positive: 57928, number of negative: 0.
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.050367 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 15250. [LightGBM] [Info] Number of data points in the train set: 57928, number of used features: 65. [LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576. [LightGBM] [Info] Start training from score 34.539576. [LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576. [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements. [1] "[1]: test's auc:0.5"...........

Regarding the result, I need to understand how to fix the problem because a I used a balanced dataset, "0" with 35648 and "1" with 22280 (as a result of SMOTE Function) but when I run the LGBM model I have ONLY ONE CLASS -Number of positive: 57928, number of negative: 0- converting this dataset as an imbalanced dataset again.

I run a SMOTE function to mitigate the problem of imbalanced dataset and then I run the LightGBM model. I expected to have a good result of this model due to I used a balanced dataset.

Why did you use `label = trainSMOTE[, trainSMOTE$target]` instead of `label = trainSMOTE$target` directly? — Ric, Mar 27 '23 at 02:05
Hi, thanks for your response. I used label = trainSMOTE$target directly, but I have the same problem when I run the model. Do you have any idea how can I fix it? — Guillermo Mansilla, Mar 28 '23 at 21:30

Running LightGBM algorithm after the implementation of SMOTE function to mitigate the issue of managing imbalanced dataset

0 Answers0