I need to run a lightGBM model with an imbalanced dataset. The dataset has a 'Target' variable with a binary result, "0" with 61471 registers and "1" with 4456 registers. To mitigate the problem of imbalance dataset I run the SMOTE Function in the train dataset as it described:
SMOTE function in train dataset.
[tag: trainSMOTE <- SMOTE(target ~ ., train, perc.over = 400, k = 5)].
[tag:dim(trainSMOTE)].
57928 67.
Dataset balanced after SMOTE function execution.
[tag:table(trainSMOTE$target)].
0 1
35648 22280.
Problem. But when I run the LightGBM model, I have a problem due to as a result of this run, I have ONE class variable with Number of positive: 57928, number of negative: 0. Below you could see part of the script.
Create LightGBM datasets.
[tag:train_data <- lgb.Dataset(data.matrix(trainSMOTE[, -9]), label = trainSMOTE[, trainSMOTE$target]).
[tag:test_data <- lgb.Dataset(data.matrix(test[, -9]), label = test[, test$target])].
Define the parameters.
[tag:params <- list(
objective = "binary",
metric = 'auc',
boosting_type = "gbdt",
num_leaves = 100,
learning_rate = 0.05,
feature_fraction = 0.9,
bagging_fraction = 0.8,
bagging_freq = 5,
min_data_in_leaf = 50,
max_depth = -1,
verbose = -1)].
Run the model.
[tag: model <- lgb.train(params = params,
data = train_data,
valids = list(test = test_data),
early_stopping_rounds = 50)].
[LightGBM] [Warning] Contains only one class[LightGBM].
[Info] Number of positive: 57928, number of negative: 0.
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.050367 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 15250.
[LightGBM] [Info] Number of data points in the train set: 57928, number of used features: 65.
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576.
[LightGBM] [Info] Start training from score 34.539576.
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576.
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements.
[1] "[1]: test's auc:0.5"...........
Regarding the result, I need to understand how to fix the problem because a I used a balanced dataset, "0" with 35648 and "1" with 22280 (as a result of SMOTE Function) but when I run the LGBM model I have ONLY ONE CLASS -Number of positive: 57928, number of negative: 0- converting this dataset as an imbalanced dataset again.
I run a SMOTE function to mitigate the problem of imbalanced dataset and then I run the LightGBM model. I expected to have a good result of this model due to I used a balanced dataset.