I am trying to create a simple model in lightgbm using two features, one is categorical and the other is a distance. I am following a tutorial (https://sefiks.com/2018/10/13/a-gentle-introduction-to-lightgbm-for-applied-machine-learning/) which states that even after LabelEncoding, I still need to tell lightgbm that my encoded feature is categorical in nature. However, I get these series of warning message when I try to do so:
UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['type']
'New categorical_feature is
{}'.format(sorted(list(categorical_feature))))
categorical_feature in param dict is overridden.
warnings.warn('categorical_feature in param dict is overridden.')
What I'm wondering is whether lightgbm does in fact understand the column is categorical in nature. It seems like it does but I'm not sure why the tutorial explicitly states it doesn't. Below is the code I have:
trainDataProc = pd.read_csv('trainDataPrepared.csv', header=0)
le=prep.LabelEncoder()
num_columns=trainDataProc.shape[1]
for i in range(0, num_columns):
column_name=trainDataProc.columns[i]
column_type=trainDataProc[column_name].dtypes
if column_type == 'object':
le.fit(trainDataProc[column_name])
encoded_feature=le.transform(trainDataProc[column_name])
trainDataProc[column_name]=pd.DataFrame(encoded_feature)
# Prepare train X and Y column names.
trainColumnsX = ['type', 'dist']
cat_feat=['type']
trainColumnsY = ['scalar']
# Perform K-fold split.
kfold = mls.KFold(n_splits=5, shuffle=True, random_state=0)
result = next(kfold.split(trainDataProc), None)
train = trainDataProc.iloc[result[0]]
test = trainDataProc.iloc[result[1]]
# Train model via lightGBM.
lgbTrain = lgb.Dataset(train[trainColumnsX], label=train[trainColumnsY],
categorical_feature=cat_feat)
lgbEval = lgb.Dataset(test[trainColumnsX], label=test[trainColumnsY])
# Model parameters.
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'mae'},
'num_leaves': 25,
'learning_rate': 0.0001,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}
# Set up training.
gbm = lgb.train(params,
lgbTrain,
num_boost_round=200,
valid_sets=lgbEval,
early_stopping_rounds=50)