0

I'm running out of memory when I try to fit a random forest model on my dataset (5888 bytes) using the rsparkling random forest function with the following:

 h2o.randomForest(x = x, 
                  y = y,
                  training_frame = trainDatasetTopTen_tbl,
                  nfolds = 5).

My configuration setting:

config <- spark_config()
config$spark.driver.cores <- 3 
config$spark.driver.memory <- "3.4G" 
config$spark.driver.extraJavaOptions <- "append -XX:MaxPermSize= 3.8G"

sc <- spark_connect(master = 'local', config = config,
                version = '2.1.0')

The memory available in my machine is 4 GB.

H2O cluster info is:

R is connected to the H2O cluster: 
H2O cluster uptime:         30 minutes 376 milliseconds 
H2O cluster version:        3.10.5.2 
H2O cluster version age:    24 days  
H2O cluster name:           sparkling-water-mubarak_local-1499963226139 
H2O cluster total nodes:    1 
H2O cluster total memory:   0.7 GB 
H2O cluster total cores:    4 
H2O cluster allowed cores:  4 
H2O cluster healthy:        TRUE 
H2O Connection ip:          127.0.0.1 
H2O Connection port:        54321 
H2O Connection proxy:       NA 
H2O Internal Security:      FALSE 
R Version:                  R version 3.4.0 (2017-04-21) 
  • H2O started log info for Java (under http://localhost:4040/sparkling-water/):

    thread INFO: Java heap totalMemory: 461.0 MB Java heap maxMemory: 910.5 MB thread INFO: Java version: Java 1.8.0_65 (from Oracle Corporation) thread INFO: JVM launch parameters: [-Xmx1g]

Therefore my question is: how to increase JVM parameter from 1GB to 3 GB?

My devtools information is:

Session info --------------------------------------
setting  value                       
version  R version 3.4.0 (2017-04-21)
system   x86_64, darwin15.6.0        
ui       RStudio (1.0.143)           
language (EN)                        
collate  en_GB.UTF-8                 
tz       Europe/London               
date     2017-07-13`  

package      * version    date      
base            * 3.4.0      2017-04-21
caret          * 6.0-76     2017-04-18
datasets     * 3.4.0      2017-04-21
dplyr         * 0.7.1      2017-06-22
ggplot2       * 2.2.1      2016-12-30
graphics      * 3.4.0      2017-04-21
grDevices    * 3.4.0      2017-04-21
h2o            * 3.10.5.2   2017-07-01
lattice      * 0.20-35    2017-03-25
methods      * 3.4.0      2017-04-21
rsparkling   * 0.2.1      2017-06-30
sparklyr     * 0.5.6-9011 2017-07-05
stats        * 3.4.0      2017-04-21
utils        * 3.4.0      2017-04-21`

Thank you, MJ

tsn
  • 838
  • 9
  • 20
mike
  • 35
  • 6
  • Hi and welcome to Stack Overflow, please take a time to go through the [welcome tour](https://stackoverflow.com/tour) to know your way around here (and also to earn your first badge), read how to create a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and also check [How to Ask Good Questions](https://stackoverflow.com/help/how-to-ask) so you increase your chances to get feedback and useful answers. – DarkCygnus Jul 13 '17 at 17:15
  • I don't do Spark or H2O but if I did I'd want to know what "JVM parameter" you meant (my guess would be that it was the `H2O cluster total memory` parameter but its best to be explicit) – tsn Jul 13 '17 at 17:50
  • Can you update your example to show how you started the H2O cluster? The H2O cluster only has 0.7 GB, which is why you're probably running out of memory. – Erin LeDell Jul 13 '17 at 19:16
  • Hi Erin, I'm starting H2o via sparklyr and rsparkling. The h2o cluster starts once i convert spark data frame to h2o data frame. – mike Jul 13 '17 at 21:12

0 Answers0