Training difference between LightGBM API and Sklearn API

Question

I'm trying to train a LGBClassifier for multiclass task. I tried first working directly with LightGBM API and set the model and training as follows:

LightGBM API

train_data = lgb.Dataset(X_train, (y_train-1))
test_data = lgb.Dataset(X_test, (y_test-1))
params = {}
params['learning_rate'] = 0.3
params['boosting_type'] = 'gbdt'
params['objective'] = 'multiclass'
params['metric'] = 'softmax'
params['max_depth'] = 10
params['num_class'] = 8
params['num_leaves'] = 500

lgb_train = lgb.train(params, train_data, 200)

# AFTER TRAINING THE MODEL

y_pred = lgb_train.predict(X_test)
y_pred_class = [np.argmax(line) for line in y_pred]
y_pred_class = np.asarray(y_pred_class) + 1

This is how the confussion matrix looks:

Sklearn API

Then I tried to move to Sklearn API to be able to use other tools. This is the code I used:

lgb_clf = LGBMClassifier(objective='multiclass',
    boosting_type='gbdt',
    max_depth=10,
    num_leaves=500,
    learning_rate=0.3,
    eval_metric=['accuracy','softmax'],
    num_class=8,
    n_jobs=-1,
    early_stopping_rounds=100,
    num_iterations=500)

clf_train = lgb_clf(X_train, (y_train-1), verbose=1, eval_set=[(X_train, (y_train-1)), (X_test, (y_test-1)))])

# TRAINING:  I can see overfitting is happening

y_pred = clf_train.predict(X_test)
y_pred = [np.argmax(line) for line in y_pred]
y_pred = np.asarray(y_pred) + 1

And this is the confusion matrix in this case:

Notes

I need to substract 1 from y_train as my classes start at 1 and LightGBM was complaining about this.
When I try a RandomSearch or a GridSearch I always obtain the same result as the last confusion matrix.
I have check different questions here but none solve this issue.

Questions

Is there anything that I'm missing out when implementing the model in Sklearn API?
Why do I obtain good results (maybe with overfitting) with LightGBM API?
How can I achieve the same results with the two APIs?

Thanks in advance.

UPDATE It was my mistake. I thought the output in both APIs would be the same but it doesn't seem like that. I just removed the np.argmax() line when predicting with Sklearn API. It seems this API already predict directly the class. Don't remove the question in case someone else is dealing with similar issues.

score 0 · Answer 1 · answered Mar 07 '23 at 16:57

I used train_test_split and lasso to find the significant features for prediction a thyroid class using softmax and multi-class. I achieved 96 percent accuracy.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

df=pd.read_csv("thyroid.csv")


le=LabelEncoder()
df["Classes"]=le.fit_transform(df["Classes"])
df["Sex"]=le.fit_transform(df["Sex"])
for col in df.columns:
        if df[col].dtype == 'object':
            #print(col)
            df[col] = df[col].map({'f': 0, 't': 1})
            
df["TSH"]=df["TSH"].astype(float)   
df["T3"]=df["T3"].astype(float)
df["TT4"]=df["TT4"].astype(float)   
df["T4U"]=df["T4U"].astype(float)    
df["FTI"]=df["FTI"].astype(float)

sns.scatterplot(x='TSH',y='T3',hue='Classes', data=df)
plt.title("TSH and T3")
plt.show()

sns.scatterplot(x='TSH',y='TT4',hue='Classes', data=df)
plt.title("TSH and TT4")
plt.show()

sns.scatterplot(x='TSH',y='T4U',hue='Classes', data=df)
plt.title("TSH and T4U")
plt.show()

sns.scatterplot(x='TSH',y='FTI',hue='Classes', data=df)
plt.title("TSH and FTI")
plt.show()

x_columns=["TSH","T3","T4U","FTI"]
target="Classes"
X=df[x_columns]
y=df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

lgb_clf.fit(X_train, y_train)
y_pred = lgb_clf.predict(X_test)

print(accuracy_score(y_pred, y_test))
cm=confusion_matrix(y_pred, y_test)
sns.heatmap(cm,annot=True,cmap=plt.cm.Blues)
plt.show()

lgb_clf = LGBMClassifier(objective='multiclass',
    boosting_type='gbdt',
    max_depth=10,
    num_leaves=500,
    learning_rate=0.3,
    eval_metric=['accuracy','softmax'],
    num_class=3,
    n_jobs=1,
    #early_stopping_rounds=100,
    num_iterations=500)


lgb_clf.fit(X_train, y_train)
y_pred = lgb_clf.predict(X_test)

print(accuracy_score(y_pred, y_test))
cm=confusion_matrix(y_pred, y_test)
sns.heatmap(cm,annot=True,cmap=plt.cm.Blues)
plt.show()

score 0 · Answer 2 · answered Mar 28 '23 at 07:25

To eliminate the difference between LightGBM(lgb) API and SKlearn(lgb.sklearn) API, you can try the following steps:

Compare the models trained through lgb API and lgb.sklearn API, pay special attention to the hyperparameters with different values(i.e., n_estimators, bagging_seed and so on) and set them as the same value in your code.
Make sure the format and content of your data is valid, different APIs have different prerequisities for the training data.

With the same hyperparameters and training data, different types of APIs will call the same C++ code(API is just a wrapper) and you'll get the same model trained through different APIs.

Training difference between LightGBM API and Sklearn API

2 Answers2