1

23Jun2017: Yet another update...
11Apr2017: I added another update below...
I added an update below...

We have developed a model using gradient boosting machine (GBM). This model was originally developed using H2O v3.6.0.8 via R v3.2.3 on a Linux machine:

$ uname -a
Linux xrdcldapprra01.unix.medcity.net 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

The following code has been working fine for months:

modelname <- 'gbm_34325f.hex'
h2o.gbm(x = predictors, y = "outcome", training_frame = modified.hex,
    validation_frame = modified_holdout.hex, distribution="bernoulli",
    ntrees = 6000, learn_rate = 0.01, max_depth = 5,
    min_rows = 40, model_id = modelname)
gbm <- h2o.getModel(modelname)
h2o.saveModel( gbm, path='.', force = TRUE )

Last week we upgraded the Linux machine to:

  • R: v 3.3.2
  • H2O: v 3.10.4.2

As shown here in the output from h2o.init():

> h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 days 1 hours 
    H2O cluster version:        3.10.4.2 
    H2O cluster version age:    14 days, 22 hours and 48 minutes  
    H2O cluster name:           H2O_started_from_R_bac_ytl642 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   18.18 GB 
    H2O cluster total cores:    64 
    H2O cluster allowed cores:  64 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 

I am now rebuilding this model from scratch in the newer version of R and H2O. When I run the above R/H2O code, it hangs on this command:

h2o.saveModel( gbm, path='.', force = TRUE )

While my program is hung at h2o.saveModel, I started another R/H2O session and connected to the currently hung process. I can successfully get the model. I can successfully run h2o.saveModelDetails and save it as JSON. And I can save it as MOJO. However, I cannot save it as a native 'hex' model via h2o.saveModel.

These are my commands and output from my connected session (while the original session remains hung up):

> h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 days 1 hours 
    H2O cluster version:        3.10.4.2 
    H2O cluster version age:    14 days, 22 hours and 48 minutes  
    H2O cluster name:           H2O_started_from_R_bac_ytl642 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   18.18 GB 
    H2O cluster total cores:    64 
    H2O cluster allowed cores:  64 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 

> modelname <- 'gbm_34325f.hex'
> gbm <- h2o.getModel(modelname)
> gbm
Model Details:
==============

H2OBinomialModel: gbm
Model ID:  gbm_34325f.hex 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1            6000                     6000           839613730         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          6         32    17.51517
[ snip ]

> model_path <- h2o.saveModelDetails( object=gbm, path='.', force=TRUE )
> model_path
[1] "/home/bac/gbm_34325f.hex.json"

# file created:
# -rw-rw-r-- 1 bac bac      552K Apr  2 12:20 gbm_34325f.hex.json
#
# first few characters are:
# {"__meta":{"schema_version":3,"schema_name":"GBMModelV3","schema_type":"GBMModel"},

> h2o.saveMojo( gbm, path='.', force=TRUE )
[1] "/home/bac/gbm_34325f.hex.zip"

# file created:
# -rw-rw-r-- 1 bac bac   7120899 Apr  2 11:57 gbm_34325f.hex.zip
#
# when I unzip this file, things look okay (altho MOJOs are new to me).

> h2o.saveModel( gbm, path='.', force=TRUE )
[ this hangs and never returns; i have to kill the entire R session ]

# empty file created:
# -rw-rw-r-- 1 bac bac         0 Apr  2 12:00 gbm_34325f.hex

I then access this hung-up process via the web interface H2OFlow. Again, I can load and view the model. When I try to export the model, an empty .hex file is created and I see the message:

Waiting for 2 responses...

(2 responses because I exported twice.)

  1. Snapshot of Export via H2OFlow
  2. Snapshot of 'Waiting for 2 responses' message from exportModel

To be clear, I am not loading an old model. Rather, I am rebuilding the model from scratch in the new R/H2O environment. I am, however, using the same R/H2O code that was successful in the older environment.

Any ideas of what is going on? Thanks.


UPDATE:

The problem I have -- h2o.saveModel hangs -- is related to OOM (out of memory).

I see these messages in the .out file created when I h2o.init:

Note:  In case of errors look at the following log files:
    /tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.out
    /tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.err

$ tail -n 6 h2o_bfo7328_started_from_r.out
[ I removed the timestamp / IP info to help made this readable ]

FJ-1-107  INFO:  2017-04-04 01:27:04 30 min 56.196 sec            6000       0.25485          0.22119      0.96950       3.54582                       0.08634
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
946-1102 INFO: GET /99/Models.bin/gbm_34325f.hex, parms: {dir=/opt/app/STUFF/bpci/training/facility_models/gbm_34325f.hex, force=TRUE}
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero   + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero   + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!

Once I realized this was an OOM issue, I changed my h2o.init to include max_mem_size:

localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = '500G')

Even with max_mem_size = '500G' set this high, I still get a OOM error (see above).

When I was running H2O v3.6.0.8, I didn't explicitly define max_mem_size.
I am curious: Now that I've upgraded to H2O v3.10.4.2, is there a larger memory demand? What was the default max_mem_size in H2O v3.6.0.8?

Any idea of what changed memory-wise between the two versions of H2O? And how I can get this to run again?

Thanks!


11Apr2017 UPDATE:

I hoped to share the dataset that generates this error. Unfortunately, the data contains protected information so I cannot share it. I created a 'scrubbed' version of this file -- contains nonsense data -- but I found it much too difficult to run this scrubbed data through our model training R code because of various dependencies and validation checks.

I have a general sense of what sorts of parameters cause the OOM (out of memory) error during h2o.saveModel.
Causes errors:

  • 51380 records with 1413 columns of data used to train
  • ntrees = 6000

Does not cause errors:

  • 51380 records with 1413 columns of data used to train
  • ntrees = 3750 (but ntrees = 4000 causes an error)

Does not cause errors:

  • 25000 records with 1413 columns of data used to train (but 40000 records causes an error)
  • ntrees = 6000

There is some combination of number of records, number of columns, and ntrees that eventually causes OOM.

Setting max_mem_size does not help at all. I set it to '100G', '200G', and '300G' and still OOM during h2o.saveModel.

Testing earlier versions of H2O

Because I cannot compromise on number of records and number of columns used to train and on the number of trees needed in the GBM, I had to go back to an earlier version of h2o.

After working with ten different versions of h2o, I found the most recent released version that does not produce OOM. The versions and the results are:

  1. v3.6.0.8 - success (original version used to create model)
  2. v3.8.1.4 - success
  3. v3.10.0.8 - success
  4. v3.10.2.1 - success
  5. v3.10.3.1 - error: OOM
  6. v3.10.3.2 - error: OOM
  7. v3.10.3.5 - error: OOM
  8. v3.10.4.2 - error: OOM (upgraded to this; found OOM error)
  9. v3.10.4.3 - error: OOM
  10. v3.11.0.3839 - success

I am not using v3.11.0.3839 since it seems to be 'bleeding edge'. I am currently running with v3.10.2.1.

I hope this helps someone track down this bug.


23Jun2017 UPDATE:

I was able to fix this problem by:

  1. upgrading to v3.10.5.1
  2. setting both min_mem_size and max_mem_size during h2o.init()

See: https://stackoverflow.com/a/44724813/7733787

BA88
  • 79
  • 9
  • have you tried rolling back to previous version of R, of H2O, using old data ? – c69 Apr 02 '17 at 23:12
  • Any chance you can provide a reproducible example? I've tried to reproduce on another dataset and I don't have the `h2o.saveModel` errors: https://gist.github.com/ledell/5223980f9cfe3cf170648c3ff2748486 I'm assuming you have the same amount of memory available to H2O now as you did back when you were using 3.6? – Erin LeDell Apr 03 '17 at 05:14
  • @c69, Unfortunately I cannot roll back the versions of R and H2O on the Linux machine. Our sysadmin did not maintain both previous and new versions. I have greater flexibility running on my mac so I'll play around there. – BA88 Apr 03 '17 at 12:19
  • 1
    @Erin LeDell, yes, I have the same amount of memory available. I'll try to create a reproducible example as I try to debug this. Also, thanks for posting an encapsulated example. I'll see if your example works in my environment. Thanks, all! – BA88 Apr 03 '17 at 12:21
  • I updated my original post. I found an OOM (out of memory) error. I set `max_mem_size = '500G'` and still I get OOM error. Any ideas how to work around this? – BA88 Apr 04 '17 at 14:46
  • I added another update to my original post. I am unable to figure out where the OOM error comes from. Therefore I had to downgrade the version of H2O I'm using. Thanks. – BA88 Apr 12 '17 at 03:08

1 Answers1

0

As this problem is directly related with memory let have you set memory properly for your h2o instance and make sure the setting is working. As you are setting max_mem_size randomly to an arbitrary number (100g, 200g, 300g) it is not going to help. First we need to know total RAM in your machine and then you can use about 80% of this memory for your h2o instance.

For example I have 16GB in my machine and I want to give 12GB for H2O instance when started from R I will do the following:

h2o.init(max_mem_size = "12g")

Once H2O is up and running I will get confirmation of memory set for H2O process as below:

R is connected to the H2O cluster: 
H2O cluster uptime:         2 seconds 166 milliseconds 
H2O cluster version:        3.10.4.3 
H2O cluster version age:    12 days  
H2O cluster name:           H2O_started_from_R_avkashchauhan_kuc791 
H2O cluster total nodes:    1 
H2O cluster total memory:   10.67 GB <=== [memory setting working]
H2O cluster total cores:    8 
H2O cluster allowed cores:  2 
H2O cluster healthy:        TRUE 
H2O Connection ip:          localhost 
H2O Connection port:        54321 
H2O Connection proxy:       NA 
H2O Internal Security:      FALSE 
R Version:                  R version 3.3.2 (2016-10-31) 

If you change your dataset size during various model building step you will see OOM with random row count because sometime Java GC will clear the unused memory and sometimes waiting to clear. So you will hit OOM once with N number and sometime you will not hit OOM with 2N numbers in the same java instance. So chasing that route is not useful.

It definitely a memory related issue and make sure you give good enough memory to H2O cluster and then see how it works.

AvkashChauhan
  • 20,495
  • 3
  • 34
  • 65
  • thank you for your reply. I agree, "chasing that route [changing `max_mem_size`] is not useful". Our Linux machine has 1.5 Tera of memory. At one point I allocated `max_mem_size = 500G` and I still got an OOM error. Nothing has changed on our Linux machine except upgrades to R (from v3.2.3 to v3.3.2) and H2O (from v3.6.0.8 to v3.10.4.2). – BA88 Apr 13 '17 at 13:34
  • The R code has not changed. The calls to H2O had not changed. The GBM model training successfully completed (ie, `h2o.saveModel` was written to disk) with the old version of R/H2O. The GBM model training throws an OOM at `h2o.saveModel` with the new environment. After working through 10 different versions of H2O (always using the new version of R v3.3.2), I found that all H2O versions 3.10.3.x and 3.10.4.x throw an OOM error. I found that H2O v3.10.2.1 works; I am now using this version. H2O version 3.11.0.3839 also works but I am not using this version as it seems to be bleeding edge. – BA88 Apr 13 '17 at 13:35
  • I hope this OOM error can be fixed otherwise we are stuck with v3.10.2.1 or will have to migrate to, say, python's scikit-learn. – BA88 Apr 13 '17 at 13:36
  • Please send an email to support@h2o.ai so we can troubleshoot your problem on your machine, I will help you there. – AvkashChauhan Apr 13 '17 at 14:01
  • thank you for offering to help. I just sent an email to support@h2o.ai. Thanks! – BA88 Apr 17 '17 at 16:03