2

I've got a rather small dataset (162,000 observations with 13 attributes) that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels) The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting) I finally gave in and stopped it. I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.

here's my code:

library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g") 
train.h20 <- as.h2o(analdata_train) 

  gbm1 <- h2o.gbm(
                    y = response_var
                  , x = independ_vars
                  , training_frame = train.h20
                  , ntrees = 3    
                  , max_depth = 5  
                  , min_rows = 10  
                  , stopping_tolerance = 0.001    
                  , learn_rate = 0.1  
                  , distribution = "multinomial" 
  )
Community
  • 1
  • 1
Ankhnesmerira
  • 1,386
  • 15
  • 29

2 Answers2

3

The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.

So 1 tree really means 20,000 trees in your case.

2 trees would really mean 40,000, and so on...

(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)

So... it will probably finish but it could take quite a long time!

TomKraljevic
  • 3,661
  • 11
  • 14
2

It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.

This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.

If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • 1
    Or (if you really need a 20,000 class output) give up on GBM and switch to deep learning: I think in this case it will be both quicker and better. (But, I would first spend a lot of effort on seeing if those 20,000 can be reduced, using domain knowledge, e.g. to nearer 50 or 100.) – Darren Cook Jun 15 '17 at 08:05
  • I agree that it's never a good idea to train a classifier with 20,000. I had even already reduced it from 50,000 levels. :-o I'm using clusters as well. I'll try to change the way data is organised before I dive into the deep learning business. Thanks all for your help. – Ankhnesmerira Jun 15 '17 at 08:35