Run coxph model for large data set with 300 columns( 6 GB ) in H2o sparkling water

Question

We are trying to run coxph model using h2o,Rsparkling for large data set with 6 GB with 300 columns, whatever the configuration we take for spark, we are getting memory issues.

As per h2o, we should only have 4 times data size bigger cluster, but we took even 128GB 4 worker nodes with a 128 master node. But still its raising issues.

Please help us to choose the spark configuration needed to run h2o with our current data set. We are able to run the same code for 50,000 records.

we have 300 columns for X and 2 pairs of interaction terms. offset column and weights as well.

You can find the sample code here but it doesnt have 300 column. I don't know how I can give the perfect input file and full code to replicate the issue. Please let me know if you prefer to see the actual code with 300 columns.

`# Load the libraries used to analyze the data
 library(survival)
 library(MASS)
 library(h2o)


 # Create H2O-based model
 predictors <- c("HasPartner", "HasSingleLine", "HasMultipleLines",
            "HasPaperlessBilling", "HasAutomaticBilling", 
 "MonthlyCharges",
            "HasOnlineSecurity", "HasOnlineBackup", "HasDeviceProtection",
            "HasTechSupport", "HasStreamingTV", "HasStreamingMovies")

 h2o_model <- h2o.coxph(x = predictors,
                   event_column = "HasChurned",
                   stop_column = "tenure",
                   stratify_by = "Contract",
                   training_frame = churn_hex)

  print(summary(h2o_model))'

What does `churn_hex` look like, in Flow, before you run `h2o.coxph`? I.e. how much memory is it using, how much cluster memory is showing as free? Because you say you have 70% categorical columns, the actual memory needed might be very different to the 6GB it occupies on disk. — Darren Cook, Dec 09 '19 at 15:12

score 1 · Answer 1 · answered Nov 27 '19 at 21:17

1

It all depends on the cardinality of the stop column and the stratification column. I would try just a single node with say 32-64GB of memory.

Please share details about the dataset.

answered Nov 27 '19 at 21:17

Michal Kurka

566
2
6

We tried on single node with 64GB but it didnt work. – Divya M Dec 02 '19 at 13:27
We have also tried to provide enough resources. below is one of the config we tried conf$spark.executor.memory <- "192g" conf$spark.executor.cores <-5 conf$spark.executor.instances <- 9 conf$'sparklyr.shell.executor-memory' <- "32g" conf$'sparklyr.shell.driver-memory' <- "32g" conf$spark.yarn.am.memory <- "32g" conf$spark.dynamicAllocation.enabled <- "false" conf$spark.driver.memory="57.6g" sc <- spark_connect(master = "yarn-client", version = "2.4.3",config = conf) – Divya M Dec 02 '19 at 13:53
The data set size is 6GB, we have 300 columns. we have 2500 different values for stratification. more than 70% columns are taken as categorical variables. for sample file T_stop column is varying between .05 - 7 approximately. – Divya M Dec 02 '19 at 13:55

score 1 · Answer 2 · answered Dec 02 '19 at 15:09

I would try to isolate the different phases of the workload, even to the point of doing any data prep in one spark job, and then doing the H2O-3 model training in a new JVM without spark at all. Then, whichever phase is causing the OOM, make sure you turn on java level GC logging.

-XX:PrintGCDetails
-XX:PrintGCTimeStamps

Take the GC logging output and feed it to http://gceasy.io and see what the curve looks like.

That will tell you if the memory growth is growing gradually or suddenly bursting up.

Run coxph model for large data set with 300 columns( 6 GB ) in H2o sparkling water

2 Answers2