3

Problem:

I have multiple datasets of transactional data that I use to predict an event (binary classification as outcome). One of them has 10,587,989 rows with 23 columns. I am attempting to run gradient boosting with 10 fold cv and ctree (package:party) but every time I run these models my system crashes.

Hardware:

16 cores, 48 gig of RAM, 48 gig of SWAP

Question:

What causes R to crash while working with large data sets even after utilizing parallel processing, adding more memory, bouncing the system?

Things I have tried:

  • Enabled parallel processing through doParallel, execute xgBoost through caret, I see every core lighting up and RAM and swap being fully utilized through top function in linux but it eventually crashes everytime.

  • Bounced the RStudio server, rebooted the system as initial maneuvering but problem persists.

I did find people commenting about H2O. I also reached out to a vendor and asked him for a solution, he suggested Sparkly but you need Hadoop layer in your server to run Sparkly.

John Doe
  • 55
  • 8

1 Answers1

2

I did find people commenting about H2O. I also reached out to a vendor and asked him for a solution, he suggested Sparkly but you need Hadoop layer in your server to run Sparkly.

Your vendor is mistaken; you don't need a Hadoop layer for sparklyr / RSparkling, just Spark.

However, you could also just skip the Spark layer and use H2O directly. That's the best option, and given my experience, I think your hardware is sufficient to train an H2O GBM on 10M rows. Here's an H2O R tutorial that shows how to perform a grid search for GBM. When you start H2O, just make sure to increase the memory from the default 4G:

h2o.init(max_mem_size = "48G")

H2O also supports XGBoost, an alternative GBM implementation, so that's another option.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • I used H2O package. My R session did not crash (yet). Though I am unclear in terms of what optimizations can I perform to minimize R crashing, your answer helped. I am marking this as answered. – John Doe Apr 06 '18 at 20:14
  • The reason it works in H2O is because its far more memory efficient the GBM implementation from the party package. It was crashing before because it was running out of memory (and now it's not). – Erin LeDell Apr 09 '18 at 18:59
  • I do see some issues though. When I convert a R data frame to H2O frame (as.h2o(my_data_frame), the number of rows has reduced to 3.5 million rows from 7 million (approx). Investigating as to why is it happening. I stopped all the other programs so my server solely works for the model building task. – John Doe Apr 09 '18 at 19:40
  • That's not good. If you can provide a reproducible example, please file a bug report here: https://0xdata.atlassian.net/issues/ If you don't need to do any data munging, load the data directly from disk into the H2O cluster using `h2o.importFile()` and skip `as.h2o()` altogether. It's much more memory efficient. If you have to use `as.h2o()`, then install data.table and set `options(h2o.use.data.table = TRUE)` to speed it up. See: https://stackoverflow.com/questions/49634547/how-to-get-data-into-h2o-fast – Erin LeDell Apr 10 '18 at 22:19