We are trying to run coxph model using h2o,Rsparkling for large data set with 6 GB with 300 columns, whatever the configuration we take for spark, we are getting memory issues.
As per h2o, we should only have 4 times data size bigger cluster, but we took even 128GB 4 worker nodes with a 128 master node. But still its raising issues.
Please help us to choose the spark configuration needed to run h2o with our current data set. We are able to run the same code for 50,000 records.
we have 300 columns for X and 2 pairs of interaction terms. offset column and weights as well.
You can find the sample code here but it doesnt have 300 column. I don't know how I can give the perfect input file and full code to replicate the issue. Please let me know if you prefer to see the actual code with 300 columns.
`# Load the libraries used to analyze the data
library(survival)
library(MASS)
library(h2o)
# Create H2O-based model
predictors <- c("HasPartner", "HasSingleLine", "HasMultipleLines",
"HasPaperlessBilling", "HasAutomaticBilling",
"MonthlyCharges",
"HasOnlineSecurity", "HasOnlineBackup", "HasDeviceProtection",
"HasTechSupport", "HasStreamingTV", "HasStreamingMovies")
h2o_model <- h2o.coxph(x = predictors,
event_column = "HasChurned",
stop_column = "tenure",
stratify_by = "Contract",
training_frame = churn_hex)
print(summary(h2o_model))'