3

I want to pass 12th column of a numpy array as categorical feature.

The column has int values from 1 to 10.

I tried this:

cbr.fit(X_train, y,
        eval_set=(X_train_test, y_test),
        cat_features=[X_train[:,12]],
        use_best_model=True,
        verbose=100)

But got this error:

CatboostError: 'data' is numpy array of np.float32, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

John Doe
  • 437
  • 1
  • 5
  • 14

2 Answers2

7

Categorical features cannot be float values. The reason for that is that categorical features are treated as strings and we must have the same string in case if you read feature value from file or from dataframe. We cannot do it for float values, but we can do it for strings and for integers.

To solve your problem you need to use dataframe where columns with categorical features will be of integer or string type.

For example,

from catboost import CatBoostClassifier, Pool
import pandas as pd

data = pd.DataFrame({'string_column': ['val0', 'val1', 'val2'],
                     'int_column': [1,2,3],
                     'float_column': [1.2,2,4.1]})
print(data)
print(data.dtypes)

train_data = Pool(
    data=data,
    label=[1, 1, -1],
    weight=[0.1, 0.2, 0.3],
    cat_features=[0, 1]
)

model = CatBoostClassifier(iterations = 10)
model.fit(X=train_data)
  • 2
    That didn't exactly do the trick for me but it led me on the right path. For future readers: Reading the examples provided in the documentation helped a lot: https://catboost.ai/docs/concepts/python-usages-examples.html – Hagbard Feb 24 '20 at 12:15
-1

It is quite literally impossible to use categorical features in Catboost using a numpy array.

The reason being that it converts to one data-type for the whole array(float) and Catboost requires your categorical features to be of type int. Mixing is not possible. Now you could build a dataframe instead and ensure that the dtypes in it is correct.

df = df.astype(dtype={
    'cat_feature1':int,
    ...
})

From there you could do this:

df_int_list = df.select_dtypes(include='int').values.tolist()
df_no_int_list = df.select_dtypes(exclude='int').values.tolist()

df_list = []
for i,v in enumerate(df_int_list):
    df_list = df_list + [v+df_no_int_list[i]]

This works because dataframe.Values will convert to a numpy array and then it will convert it to a list. If you only have integer values in the list it will use that.

cat_features=list(range(0,len(dataframe_int_list[0])))
train_data = Pool(
    data=df_list, # ensure your target values are removed
    label=... # insert your target values
    cat_features=cat_features
)

model = CatBoostClassifier()
model.fit(X=train_data)
kodkirurg
  • 156
  • 8