0

I'm trying to understand the pros/cons and when to use the various encoding options that are available to me in h2o with the parameter 'categorical_encoding'.

It would be helpful if people could point out general rules of thumb on how to use this.

Typically I use the 'Enum' value because I like how all categorical values are grouped together when looking at feature importance. On the other hand, xgboost's default value is 'label-encoder' I believe, which breaks things up by categorical level/value.

Unfortunately, I don't really know where to begin or questions to ask around these other values available:

  • one hot internal
  • one hot explicit
  • sort_by_response
  • enum_limited
  • enum -label-encoder

Again, I primarily stick with enum, sometimes label-encoder, but honestly I don't know practical implications of these various options. Would love a generalized understanding of when one might be better than other from someone knowledgeable !

runningbirds
  • 6,235
  • 13
  • 55
  • 94
  • hi @runningbirds since this isn't a coding specific question, it would be great if you could post this to stackexchanges cross-validated https://stats.stackexchange.com/questions/tagged/h2o – Lauren Nov 09 '18 at 03:20
  • thanks so much for posting this question to the correct location ! – Lauren Nov 10 '18 at 01:12

1 Answers1

0

As requested (thanks!) this question was reposted to cross-validated. So the answer on what the pros and cons are can be found at: https://stats.stackexchange.com/questions/376203/categorical-encoding-in-h2o-what-is-the-difference-between-the-options

Lauren
  • 5,640
  • 1
  • 13
  • 19