0

I need to fit GLMs on data that doesn't fit into my computer's memory. Usually to get around this issue, I would sample data, fit the model and then test on a different sample that would sit out of memory. This has been R's major limitation for me which is why for fitting GLM's SAS has been preferred since it doesn't stumble with data that doesn't fit into memory.

I've been trying to find ways to solve this issue with R on my local machine and want to know if Sparklyr can be used to get around the memory issue? I realise Spark is meant to be used in a cluster environment etc, but straight up - can Sparklyr be used to work with data on my local machine that would otherwise not fit into its memory?

  • If you search for things related to out-of-memory glm and R, you'll come across the `ff` package and the `biglm` package. You could start reading the documentation and look for examples. – Jota Jan 25 '17 at 20:09
  • Thanks for the suggestion. I did have a look at them now. Perhaps I'm not fully clued up on their workings but it seems ff and the family of 'big' R packages are mostly workarounds that don't integrate seamlessly with other R packages (etc Tidyverse) so I dont think there is a fix that 'just works'. – Serban Dragne Jan 26 '17 at 08:03

1 Answers1

0

Spark, and Sparklyr, work great at distributing load but are not likely to resolve your issue on one box with a single Spark instance. You may have better luck with H2O https://cran.r-project.org/web/packages/h2o/index.html

PhilC
  • 767
  • 3
  • 8