0

I'm trying to run parallel cv.glmnet poisson models on a windows machine with 64Gb of RAM. My data is a 20 million row x 200 col sparse matrix, around 10Gb in size. I'm using makecluster and doParallel, and setting parallel = TRUE in cv.glmnet. I currently have two issues getting this setup:

  1. Distributing data to different processes is taking hours, reducing speedup significantly. I know this can be solved using fork on linux machines, but is there any way of reducing this time on windows?

  2. I'm running this for multiple models with data and responses, so the object size is changing each time. How can I work out in advance how many cores I can run before getting an 'out of memory' error? I'm particularly confused at how the data gets distributed. If I run on 4 cores, the first rsession will use 30Gb of memory, while the others will be closer to 10Gb. What does that 30 Gb go towards, and is there any way of reducing it?

nolanp2
  • 51
  • 6
  • You might want to get a look at packages {biglasso} and {bigstatsr} that use matrix data on disk. – F. Privé May 24 '18 at 10:44
  • @F. Privé I'll give it a try, but will accessing data on disk be any faster for each rsession than distributing the data to each? – nolanp2 May 24 '18 at 13:25
  • It will use less memory (no copy). – F. Privé May 24 '18 at 13:51
  • I should have mentioned I'm building poisson models, which don't seem to be available for biglasso yet unfortunately. The thing I'm still confused by is the 30Gb of memory being used by my rsession, is there a reason for so much memory being used there on top of each script taking 10 Gb? – nolanp2 May 25 '18 at 08:42
  • Two copies of your data and you are already at 30GB. – F. Privé May 25 '18 at 12:17
  • so the session needs 3 copies of the data, and then each thread needs a copy as well? I'm currently using ~50Gb from 3 cores, I would have thought I should be able to do this with 30Gb – nolanp2 May 27 '18 at 15:51

0 Answers0