CatBoost and UnicodeEncodeError

Question

I'm trying to use Python 2.7 and CatBoostRegressor with Pandas but I get

UnicodeEncodeError: 'ascii' codec can not encode characters in position 0-4: ordinal not in range (128)

I use a unicode sandwich and read csv as: df = pd.read_csv ('out.csv', index_col = 0, encoding = 'utf8'). After reading the data, I perform a check:

print df.apply(lambda x: pd.lib.infer_dtype(x.values))

node        integer
name        unicode
region      unicode
price      floating
hour        integer
year        integer
month       integer
day         integer
dtype: object

Apparently, Catboost tries to make the encoding, but not successfully. How can this be avoided?

simplified code:

import pandas as pd
from catboost import CatBoostRegressor


lst2 = [100001,100002,100003,100004,100005]
lst3 = [u'Хлеб',u'Молоко',u'Чай',u'Кофеёк',u'Пончики']
lst4 = [100.0,200.1,100.0,3.5,200.0]
lst5 = [876.0,185.1023,101.12698,301.5023,200.0]
lst6 = [1,1,1,1,1]

df = pd.DataFrame({u'node' : lst2, u'name':lst3, u'vol':lst4, u'price':lst5, u'hour':lst6},
                  columns=[u'node', u'name', u'vol', u'price', u'hour'])

train_data = df.iloc[:-2, :]
train_labels = train_data[u'price'].values
train_data = train_data.drop([u'price'], axis = 1)


cat_features = [1]
clf = CatBoostRegressor(iterations=100, learning_rate=0.1, depth=4)
clf.fit(train_data, train_labels, cat_features)

Use `enc = lambda x : x.encode('utf8') df[u'name_cat_col1']=df[u'name_cat_col1'].map(enc) df[u'name_cat_col2']=df[u'name_cat_col2'].map(enc)` — hyper, Sep 11 '18 at 06:25

score 0 · Answer 1 · answered Sep 29 '18 at 16:22

0

This looks like an issue with cython. What could help in your case is replacing unicode with bytes, like this: lst3 = [b'Хлеб',...

answered Sep 29 '18 at 16:22

Dmitry Baksheev

36
3

CatBoost and UnicodeEncodeError

1 Answers1