I have a dataset with 200+ categorical variables (non-ordinal) and just a few continuous variables. I have tried to use one-hot encoding but that increases the dimensions by a lot and results in a poor score.
It seems like the regular scikit-learn tree can only be used with categorical variables that has been transformed into one-hot encoding (for non-ordinal vars) and I was if there's a way to create a tree without one-hot. I did some research and found that there's an API called h2o that might be useful, but I'm trying to find a way to run it on my local machine.
Asked
Active
Viewed 1,918 times
4

welo121
- 47
- 6
-
in what framework? please be specific (why the `h2o` tag?) – desertnaut Jul 05 '19 at 17:47
-
My bad, please see my edits – welo121 Jul 05 '19 at 17:51
-
Good question - I really don't like R, but it does seem that the rpart() feature in R deals with categorical variables much more elegantly than in Python. e.g. in section 1.2 of this article https://www.kaggle.com/floser/glm-neural-nets-and-xgboost-for-insurance-pricing/comments – langbourne May 10 '21 at 16:55
2 Answers
5
you can install the h2o-3 package for python, for example, from h2o.ai/downloads or from pypi.
the h2o package handles categorical values automatically efficiently. it is recommended to not one-hot-encode them first.
you can find lots of documentation at docs.h2o.ai.

TomKraljevic
- 3,661
- 11
- 14
-
will my data stay on my local machine though or be sent to h2o server? couldn't figure out how does the API works (if it keeps my data or get access to it) – welo121 Jul 05 '19 at 18:24
-
@welo121 When you run `h2o.init()` the H2O instance will be initialized wherever you run this command (like your local machine). Then when you move data into this H2O instance it will still be on your local machine. – mtanco Jul 05 '19 at 18:34
-
so the data will never leave my machine into a server or something? have to make sure because its confidential data – welo121 Jul 05 '19 at 18:39
-
@welo121 It is a server/client architecture, but by default both server and client are on localhost, so your data will not leave the machine. – Darren Cook Jul 06 '19 at 15:21
-
if the h2o.init() "ip" parameter is localhost, then the data is not leaving your machine (the data is sent to localhost). if the "ip" parameter is not localhost then the data will leave your machine (and be sent to the machine you named in the "ip" parameter). – TomKraljevic Jul 06 '19 at 21:35
0
As per, https://datascience.stackexchange.com/a/32623/51879
You can use other encoding techniques using this wrapper for scikit-learn http://contrib.scikit-learn.org/categorical-encoding/
Also check out this great article for more details https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931.

Ankush Chauhan
- 93
- 7