0

sklearn 0.24.2, python = 3.8

While trying to use decision tree regressor using sklearn I've came across common problem. I've read lots of questions however there isn't any definitive answer.

I want to handle categorical(non-ordinal, high cardinality) column however using:

  • OrdinalEncoder leads to assigning orders such as 1 < 2< 3, and so on... which is a problem since my column do not have any order
  • OneHotEncoder leads to high dimensions. Leading to higher depth tree and more computations.

is there no way to use sklearn? if not, what other libraries are recommended?

from Can sklearn DecisionTreeClassifier truly work with categorical data? it says I could use BinaryEncoding however this seems to be used when cardinality=2.

Also in R's library(tree) it handles categorical column without any preprocessing, how can this be done in similar way in python?

haneulkim
  • 4,406
  • 9
  • 38
  • 80
  • I think that's just the cost of using categorical data with Decision Trees. You might want to consider encoding the categories into a smaller dimensional space by doing something like feature hashing. There's a pretty nice description that can be found here: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63 – Andrew Wei Aug 31 '21 at 02:53
  • But in R, it handles categorical data w/o any data preprocessing. Wondering if there is a library that allows this in python. – haneulkim Aug 31 '21 at 02:54
  • 2
    Do you mind giving the name of the function/library that does this so we have a better idea of what you're looking for? – Andrew Wei Aug 31 '21 at 02:55

0 Answers0