23Jun2017: Yet another update...
11Apr2017: I added another update below...
I added an update below...
We have developed a model using gradient boosting machine (GBM). This model was originally developed using H2O v3.6.0.8 via R v3.2.3 on a Linux machine:
$ uname -a
Linux xrdcldapprra01.unix.medcity.net 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The following code has been working fine for months:
modelname <- 'gbm_34325f.hex'
h2o.gbm(x = predictors, y = "outcome", training_frame = modified.hex,
validation_frame = modified_holdout.hex, distribution="bernoulli",
ntrees = 6000, learn_rate = 0.01, max_depth = 5,
min_rows = 40, model_id = modelname)
gbm <- h2o.getModel(modelname)
h2o.saveModel( gbm, path='.', force = TRUE )
Last week we upgraded the Linux machine to:
- R: v 3.3.2
- H2O: v 3.10.4.2
As shown here in the output from h2o.init()
:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 days 1 hours
H2O cluster version: 3.10.4.2
H2O cluster version age: 14 days, 22 hours and 48 minutes
H2O cluster name: H2O_started_from_R_bac_ytl642
H2O cluster total nodes: 1
H2O cluster total memory: 18.18 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
I am now rebuilding this model from scratch in the newer version of R and H2O. When I run the above R/H2O code, it hangs on this command:
h2o.saveModel( gbm, path='.', force = TRUE )
While my program is hung at h2o.saveModel
, I started another R/H2O session and connected to the currently hung process. I can successfully get the model. I can successfully run h2o.saveModelDetails
and save it as JSON. And I can save it as MOJO. However, I cannot save it as a native 'hex' model via h2o.saveModel
.
These are my commands and output from my connected session (while the original session remains hung up):
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 days 1 hours
H2O cluster version: 3.10.4.2
H2O cluster version age: 14 days, 22 hours and 48 minutes
H2O cluster name: H2O_started_from_R_bac_ytl642
H2O cluster total nodes: 1
H2O cluster total memory: 18.18 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
> modelname <- 'gbm_34325f.hex'
> gbm <- h2o.getModel(modelname)
> gbm
Model Details:
==============
H2OBinomialModel: gbm
Model ID: gbm_34325f.hex
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 6000 6000 839613730 5
max_depth mean_depth min_leaves max_leaves mean_leaves
1 5 5.00000 6 32 17.51517
[ snip ]
> model_path <- h2o.saveModelDetails( object=gbm, path='.', force=TRUE )
> model_path
[1] "/home/bac/gbm_34325f.hex.json"
# file created:
# -rw-rw-r-- 1 bac bac 552K Apr 2 12:20 gbm_34325f.hex.json
#
# first few characters are:
# {"__meta":{"schema_version":3,"schema_name":"GBMModelV3","schema_type":"GBMModel"},
> h2o.saveMojo( gbm, path='.', force=TRUE )
[1] "/home/bac/gbm_34325f.hex.zip"
# file created:
# -rw-rw-r-- 1 bac bac 7120899 Apr 2 11:57 gbm_34325f.hex.zip
#
# when I unzip this file, things look okay (altho MOJOs are new to me).
> h2o.saveModel( gbm, path='.', force=TRUE )
[ this hangs and never returns; i have to kill the entire R session ]
# empty file created:
# -rw-rw-r-- 1 bac bac 0 Apr 2 12:00 gbm_34325f.hex
I then access this hung-up process via the web interface H2OFlow. Again, I can load and view the model. When I try to export the model, an empty .hex
file is created and I see the message:
Waiting for 2 responses...
(2 responses
because I exported twice.)
To be clear, I am not loading an old model. Rather, I am rebuilding the model from scratch in the new R/H2O environment. I am, however, using the same R/H2O code that was successful in the older environment.
Any ideas of what is going on? Thanks.
UPDATE:
The problem I have -- h2o.saveModel
hangs -- is related to OOM
(out of memory).
I see these messages in the .out
file created when I h2o.init
:
Note: In case of errors look at the following log files:
/tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.out
/tmp/RtmpOnJn83/h2o_bfo7328_started_from_r.err
$ tail -n 6 h2o_bfo7328_started_from_r.out
[ I removed the timestamp / IP info to help made this readable ]
FJ-1-107 INFO: 2017-04-04 01:27:04 30 min 56.196 sec 6000 0.25485 0.22119 0.96950 3.54582 0.08634
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
2946-780 INFO: GET /3/Models/gbm_34325f.hex, parms: {}
946-1102 INFO: GET /99/Models.bin/gbm_34325f.hex, parms: {dir=/opt/app/STUFF/bpci/training/facility_models/gbm_34325f.hex, force=TRUE}
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
946-1102 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:3.15 GB + POJO:Zero + FREE:441.54 GB == MEM_MAX:444.44 GB), desiredKV=299.74 GB OOM!
Once I realized this was an OOM issue, I changed my h2o.init
to include max_mem_size
:
localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1, max_mem_size = '500G')
Even with max_mem_size = '500G'
set this high, I still get a OOM error (see above).
When I was running H2O v3.6.0.8, I didn't explicitly define max_mem_size
.
I am curious: Now that I've upgraded to H2O v3.10.4.2, is there a larger memory demand? What was the default max_mem_size
in H2O v3.6.0.8?
Any idea of what changed memory-wise between the two versions of H2O? And how I can get this to run again?
Thanks!
11Apr2017 UPDATE:
I hoped to share the dataset that generates this error. Unfortunately, the data contains protected information so I cannot share it. I created a 'scrubbed' version of this file -- contains nonsense data -- but I found it much too difficult to run this scrubbed data through our model training R code because of various dependencies and validation checks.
I have a general sense of what sorts of parameters cause the OOM (out of memory) error during h2o.saveModel
.
Causes errors:
- 51380 records with 1413 columns of data used to train
- ntrees = 6000
Does not cause errors:
- 51380 records with 1413 columns of data used to train
- ntrees = 3750 (but ntrees = 4000 causes an error)
Does not cause errors:
- 25000 records with 1413 columns of data used to train (but 40000 records causes an error)
- ntrees = 6000
There is some combination of number of records, number of columns, and ntrees that eventually causes OOM.
Setting max_mem_size
does not help at all. I set it to '100G', '200G', and '300G' and still OOM during h2o.saveModel
.
Testing earlier versions of H2O
Because I cannot compromise on number of records and number of columns used to train and on the number of trees needed in the GBM, I had to go back to an earlier version of h2o.
After working with ten different versions of h2o, I found the most recent released version that does not produce OOM. The versions and the results are:
- v3.6.0.8 - success (original version used to create model)
- v3.8.1.4 - success
- v3.10.0.8 - success
- v3.10.2.1 - success
- v3.10.3.1 - error: OOM
- v3.10.3.2 - error: OOM
- v3.10.3.5 - error: OOM
- v3.10.4.2 - error: OOM (upgraded to this; found OOM error)
- v3.10.4.3 - error: OOM
- v3.11.0.3839 - success
I am not using v3.11.0.3839 since it seems to be 'bleeding edge'. I am currently running with v3.10.2.1.
I hope this helps someone track down this bug.
23Jun2017 UPDATE:
I was able to fix this problem by:
- upgrading to v3.10.5.1
- setting both
min_mem_size
andmax_mem_size
duringh2o.init()