4

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.

How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?

digestivee
  • 690
  • 1
  • 8
  • 16

3 Answers3

3

You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.

It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.

Check this article for more information.

Dany Majard
  • 103
  • 1
  • 9
  • 1
    Yeah, long time since I wrote the question and learnt a lot since then, but I think this would be my approach if I went back to this problem. Embeddings are truly amazing in my book – digestivee Apr 27 '18 at 09:39
1

XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.

CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.

0

Assuming you have enough domain expertise, you could create a new categorical column from existing column. ex:- if you column has below values

A,B,C,D,E,F,G,H

if you are aware that A,B,C are similar D,E,F are similar and G,H are similar your new column would be

Z,Z,Z,Y,Y,Y,X,X.

In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

prudhvi Indana
  • 789
  • 7
  • 19