Trouble training xgboost on categorical column

Question

I am trying to run a Python notebook (link). At line below In [446]: where author train XGBoost, I am getting an error

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Here is the minimal code for testing

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Link to train_store data file: Link 1

This isn't Minimal, as in MCVE. Do you mean the `StateHoliday` column is a categorical? If yes please say so in the question. — smci, Mar 06 '20 at 12:58

score 16 · Answer 1 · answered Sep 10 '19 at 02:28

I met the exactly same issue when i am doing Rossmann Sales Prediction Project. It seems like new version of xgboost do not accept the datatype of StateHoliday, Assortment, and StoreType. you can check the datatype as Mykhailo Lisovyi suggested by using

print(test_train.dtypes)

you need to replace test_train here with your X_train

you might can get

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

the error raised up to object type. You can convert them with

from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

Everything will go well after those steps.

score 9 · Accepted Answer · answered Nov 11 '19 at 12:03

9

Try this

train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])

answered Nov 11 '19 at 12:03

Atinesh

1,790
9
36
57

6

If you want to use a trained model in production and need to apply **the same** encoding for test samples in the future, you must use another way of encoding, for example scikit Transformers as shown by Zhi Yuan in his answer, so that the transform can be saved together with the model. Running pd.to_numeric() on new data will likely result in a **different** mapping than you used originally during training! – Marcin Wojnarski Dec 14 '20 at 13:13

score 0 · Answer 3 · answered May 12 '19 at 08:30

0

As the error message suggests, xgboost is unhappy, that you try to feed it unknown types. It says that it can not deal with categorical or datetime features. Check the type of StateHoliday, Assortment features and encode them into numbers in some way (for example One-Hot Encoding, label encoding (works for treee-based models) or target encoding)

answered May 12 '19 at 08:30

Mischa Lisovyi

3,207
18
29

I have checked the datatype it's `int` – arush1836 May 12 '19 at 11:48
Could you please check from the full stack trace which of the commands causes the `ValueError` and add the `.dtypes` dump for that dataframe to the original question? The origin of the problem is in a type, that is not supported by xgboost. Supported types are listed here in the code: https://github.com/dmlc/xgboost/blob/e7d17ec4f4a091bac58c1d241be3f4969400b874/python-package/xgboost/core.py#L220 – Mischa Lisovyi May 12 '19 at 12:15

score -1 · Answer 4 · answered Feb 19 '20 at 20:51

-1

The XGBoost version in the H2O package can handle categorical variables (but not too many!) but it appears that XGBoost as its own package can't.

I tried this with pandas dataframes but xgboost didn't like it

categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')

To use H2O with categoricals, you will have to convert strings to categoricals first:

h2odf[categoricals] = h2odf[categoricals].asfactor()

Note too, that h2o has its own dataframes that are different from pandas.

answered Feb 19 '20 at 20:51

Clem Wang

689
8
14

This is inaccurate. XGBoost can handle categorical on its own. https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html?highlight=enable_categorical#using-native-interface – DataDog Nov 04 '21 at 12:55
Not exactly... I do not consider "One-Hot-Encoding" a satisfactory way of handling categorical variables. This is because it diultes the features space without adding any more information. https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html says "At the moment, the support is implemented as one-hot encoding based categorical tree splits. " – Clem Wang Jan 27 '22 at 06:17
2

Your opinion on if it's satisfactory isn't really the point. It can handle them on its own, you simply don't like how it's done. This is, up for debate however your original statement in regards to how the XGBoost library can't handle categorical values by itself, is, false. – DataDog Jan 28 '22 at 14:47

Trouble training xgboost on categorical column

4 Answers4