Using Catboost Classifier to convert categorical columns

Question

I'm trying to apply CatBoost to one of my columns for categorical features but get following error:

CatBoostError: Invalid type for cat_feature[non-default value idx=0,feature_idx=2]=68892500.0 : cat_features must be integer or string, real number values and NaN values should be converted to string.

I could use one-hot encoding but many on here say CatBoost seems to better at handling this and less prone to overfitting the model.

My data consists of three columns, 'Country', 'year', 'phone users'. Target is 'Country' and 'year' and 'phone users' are Feature.

Data:

Country   year   phone users
Ireland   1989   978
France    1990   854
Spain     1991   882
Turkey    1992   457
...       ...    ...

My code so far:

X = df.loc[115:305]
y = df.loc[80:, 0]

cat_features = list(range(0, X_pool.shape[1]))
Output: [0, 1, 2]

X_train, X_val, y_train, y_val = train_test_split(X_pool, y_pool, 
test_size=0.2, random_state=0)

cbc = CatBoostClassifier(iterations=5, learning_rate=0.1)

cbc.fit(X_train, y_train, eval_set=(X_val, y_val), 
cat_features=cat_features, verbose=False)

print("Model Evaluation Stage")

Do I need to run LabelEncoder before fitting to catboost model? What am I missing here?

Flavia Giammarino · Accepted Answer · 2021-05-23T17:41:06.117

1

As stated in the error message included in your question all the categorical features need to be of type string. To cast 'phone users' (or any other data frame column) to string you can use df['phone users'] = df['phone users'].astype(str).

CatBoost will then internally encode each categorical feature using either one-hot encoding or target encoding depending on the number of unique values that it takes. There is no need to encode the categorical features beforehand using the LabelEncoder or the OneHotEncoder, see the CatBoost documentation for more details.

edited May 23 '21 at 17:41

answered Apr 13 '21 at 17:56

Flavia Giammarino

7,987
11
30
40

I converted all columns to strings except `age, distance, delay` since they are int64. Of course I get an error because catboost wants strings. So do I drop these three columns or do I actually convert them to strings even though they are not categorical features? – Edison Jun 30 '22 at 03:59

Using Catboost Classifier to convert categorical columns

1 Answers1