2

At first I thought it was a random issue, but re-running the script it happens again.

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix,  : 
Unexpected CURL error: Recv failure: Connection reset by peer

I'm doing a grid search on a medium-size dataset (roughly 40000 x 30) with a Gradient Boosting Machine model. The largest tree in the grid is 1000. This usually happens after running for a couple of hours. I set max_mem_size to 30Gb.

for ( k in 1:nrow(par.grid)) {
    hg = h2o.gbm(training_frame = Xtr.hf, 
                 validation_frame = Xt.hf,
                 distribution="huber",
                 huber_alpha = HuberAlpha,
                 x=2:ncol(Xtr.hf),        
                 y=1,                     
                 ntrees = par.grid[k,"ntree"],
                 max_depth = depth,
                 learn_rate = par.grid[k,"shrink"],
                 min_rows = par.grid[k,"min_leaf"],
                 sample_rate = samp_rate,
                 col_sample_rate = c_samp_rate,
                 nfolds = 5,
                 model_id = p(iname, "_gbm_CV")
                 )
    cv_result[k,1] = h2o.mse(hg, train=TRUE)
    cv_result[k,2] = h2o.mse(hg, valid=TRUE)
  }
horaceT
  • 621
  • 13
  • 26
  • Have you tried giving H2O more memory? It might be running out of memory on the H2O cluster. I can't tell how many models you are trying to train (technically there will be `(5+1)*nrow(par.grid)` total models because you have `nfolds = 5`), but GBM models can be big and eat up your RAM... – Erin LeDell Jul 28 '17 at 00:24
  • @ErinLeDell I can confirm it's RAM. This is actually an inner loop in another loop, so the memory demand is even bigger. Question for you is why does it keep all the (5+1)*N models? Once a run is finished, the previous model should be overwritten, right? – horaceT Jul 28 '17 at 15:43

1 Answers1

2

Try adding gc() in your innermost loop. Even better would be to explicitly use h2o.rm().

So, it would become something like:

for ( k in 1:nrow(par.grid)) {
  hg = h2o.gbm(...stuff...,
             model_id = p(iname, "_gbm_CV")
             )
  cv_result[k,1] = h2o.mse(hg, train=TRUE)
  cv_result[k,2] = h2o.mse(hg, valid=TRUE)
  h2o.rm(hg);rm(hg);gc()
}

Theoretically this shouldn't matter, but if R holds on to the reference, then H2O will too.

If you think you might want to investigate any models further, and you have plenty of local disk space, you could do h2o.saveModel() before your h2o.mse() calls. (You'll need to specify a filename that somehow summarizes all your parameters, of course...)

UPDATE based on comment: If you do not need to keep any models or data, then using h2o.removeAll() is another way to rapidly reclaim all the memory. (This approach is also worth considering if any data or models you do need preserved are quick and easy to re-load.)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • I think the memory issue is with the java server, which keeps as many copies of the dataset as CV iterations. So `gc()` won't help. What I ended up doing is call h2o.removeAll() in the outer loop. – horaceT Jul 29 '17 at 17:48
  • @horaceT Yes, it is h2o (the java server) holding the memory. But it won't release it while it thinks a client is using it, apparently even when re-using a model id, which is why you do the explicit `gc()` on the R client. `h2o.removeAll()` is even better when you have that option; I will edit my answer to mention that. – Darren Cook Jul 30 '17 at 07:50
  • My experience has been `gc()` doesn't do much to help R's memory issue. It's a pity that while R has gained so much popularity in recent yrs, its core weaknesses have never been addressed. Or maybe it's just me grumbling.... – horaceT Jul 30 '17 at 19:41