Problem:
I have multiple datasets of transactional data that I use to predict an event (binary classification as outcome). One of them has 10,587,989 rows with 23 columns. I am attempting to run gradient boosting
with 10 fold cv and ctree (package:party
) but every time I run these models my system crashes.
Hardware:
16 cores, 48 gig of RAM, 48 gig of SWAP
Question:
What causes R to crash while working with large data sets even after utilizing parallel processing, adding more memory, bouncing the system?
Things I have tried:
Enabled parallel processing through
doParallel
, execute xgBoost throughcaret
, I see every core lighting up and RAM and swap being fully utilized throughtop
function in linux but it eventually crashes everytime.Bounced the RStudio server, rebooted the system as initial maneuvering but problem persists.
I did find people commenting about H2O. I also reached out to a vendor and asked him for a solution, he suggested Sparkly but you need Hadoop layer in your server to run Sparkly.