9

I want to use LightGBM to predict the tradeMoney of house, but I get troubles when I have specified categorical_feature in the lgb.Dataset of LightGBM.
I get data.dtypes as follows:

type(train)
pandas.core.frame.DataFrame

train.dtypes
area                  float64
rentType               object
houseFloor             object
totalFloor              int64
houseToward            object
houseDecoration        object
region                 object
plate                  object
buildYear               int64
saleSecHouseNum         int64
subwayStationNum        int64
busStationNum           int64
interSchoolNum          int64
schoolNum               int64
privateSchoolNum        int64
hospitalNum             int64
drugStoreNum            int64

And I use LightGBM to train it as follows:

categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)

    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)

    predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))

BUT it still gives such error messages even if I have specified the categorical_features.

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields rentType, houseFloor, houseToward, houseDecoration, region, plate

And here are the requirements:

LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]

Could anyone help me, please?

Bowen Peng
  • 1,635
  • 4
  • 21
  • 39

2 Answers2

22

The problem is that lightgbm can handle only features, that are of category type, not object. Here the list of all possible categorical features is extracted. Such features are encoded into integers in the code. But nothing happens to objects and thus lightgbm complains, when it finds that not all features have been transformed into numbers.

So the solution is to do

for c in categorical_feats:
    train[c] = train[c].astype('category')

before your CV loop

Mischa Lisovyi
  • 3,207
  • 18
  • 29
  • 2
    Maybe u didn't use the dataset api , Note: You should convert your categorical features to int type before you construct `Dataset`. – Mithril May 07 '21 at 10:19
  • 1
    @Mithril lightgbm can recognize pandas category type automatically. – lovetl2002 May 23 '23 at 09:27
  • How would the data need to be passed at inference time then? As string or as a int? If the latter how could we get the mapping string -> int? – 3nomis Aug 16 '23 at 14:30
  • @3nomis if you use `category` feature in the training data, then the same feature hast to be `category` at the inference stage and the code will take care of the `category` -> `int` conversion internally (see the code linked in the original answer) – Mischa Lisovyi Aug 24 '23 at 19:18
  • Seeing the code linked, the list of possible categories will be enforced to the one from the training data using `data[col].cat.set_categories(...)`, which means according to the [set_categories docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.set_categories.html), that any new category at the inference time will be set to a NaN – Mischa Lisovyi Aug 24 '23 at 19:27
1

You should convert your categorical features to int type before you construct Dataset. You will find this info in https://lightgbm.readthedocs.io/en/latest/Python-Intro.html I had cases with categorical features and integer features and for the same error. Solution was to convert all categorical to int.

Femosh
  • 49
  • 3