If I LabelEncode categorical data, do I still need to use categorical_feature when creating a LightGBM dataset?

Question

I am trying to create a simple model in lightgbm using two features, one is categorical and the other is a distance. I am following a tutorial (https://sefiks.com/2018/10/13/a-gentle-introduction-to-lightgbm-for-applied-machine-learning/) which states that even after LabelEncoding, I still need to tell lightgbm that my encoded feature is categorical in nature. However, I get these series of warning message when I try to do so:

UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['type']
 'New categorical_feature is 
    {}'.format(sorted(list(categorical_feature))))
 categorical_feature in param dict is overridden.
  warnings.warn('categorical_feature in param dict is overridden.')

What I'm wondering is whether lightgbm does in fact understand the column is categorical in nature. It seems like it does but I'm not sure why the tutorial explicitly states it doesn't. Below is the code I have:

trainDataProc = pd.read_csv('trainDataPrepared.csv', header=0)

le=prep.LabelEncoder()

num_columns=trainDataProc.shape[1]

for i in range(0, num_columns):
    column_name=trainDataProc.columns[i]
    column_type=trainDataProc[column_name].dtypes
    if column_type == 'object':
        le.fit(trainDataProc[column_name])
        encoded_feature=le.transform(trainDataProc[column_name])
        trainDataProc[column_name]=pd.DataFrame(encoded_feature)

# Prepare train X and Y column names.
trainColumnsX = ['type', 'dist']
cat_feat=['type']
trainColumnsY = ['scalar']

# Perform K-fold split.
kfold = mls.KFold(n_splits=5, shuffle=True, random_state=0)
result = next(kfold.split(trainDataProc), None)
train = trainDataProc.iloc[result[0]]
test = trainDataProc.iloc[result[1]]

# Train model via lightGBM.
lgbTrain = lgb.Dataset(train[trainColumnsX], label=train[trainColumnsY], 
                       categorical_feature=cat_feat)
lgbEval = lgb.Dataset(test[trainColumnsX], label=test[trainColumnsY])

# Model parameters.
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'mae'},
    'num_leaves': 25,
    'learning_rate': 0.0001,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Set up training.
gbm = lgb.train(params,
                lgbTrain,
                num_boost_round=200,
                valid_sets=lgbEval,
                early_stopping_rounds=50)

score 2 · Answer 1 · answered Aug 29 '19 at 19:34

I was also facing similar warning message and looked at scikit-learn documentation.

Its not required to define separately if you are using pandas Dataframe with categorical features(and label encoded as integers).

Default value of parameter 'categorical_feature' for LGBM is 'Auto', which ensures that pandas categorical columns are automatically used.

score 1 · Answer 2 · answered Aug 07 '19 at 15:28

The reason why you should still tell LightGBM that the features that you encode are categorical is because the model sees a numerical variable so it will try to split the variable using bigger or smaller than a threshold that is not correct if we are talking about a categorical variable.

score 0 · Answer 3 · answered Aug 09 '19 at 21:12

Even if you do Label Encoding, all it does is it labels the values in some order. But that column will still contain numbers which is why LightGBM will try to split on that variable like it's a continuous feature.
So we need to provide that column name explicitly so that LightGBM knows that it needs to treat that variable differently.

If I LabelEncode categorical data, do I still need to use categorical_feature when creating a LightGBM dataset?

3 Answers3