cv.glmnet parallel and memory issue

Question

It is the first time I am using parallel processing in general. The question is mainly about my poor syntax.

I would like some help in capturing the output for a large number of cv.glmnet iterations, as I believe I have built cv_loop_run to be badly inefficient. This, along with the number of lambdas being 10k leads to a massive matrix which takes all of my memory and causes a crash. In essence what I need is the minimum and the 1se lambda by each run (1000 of them, not all 10,000). So instead of having a 1kx10k list captured for cv_loop_run I would get a 1k long list.

  registerDoParallel(cl=8,cores=4)  
  cv_loop_run<- rbind( foreach(r = 1:1000,
                              .packages="glmnet",
                              .combine=rbind,
                              .inorder =F) %dopar% {

                        cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                           nlambda = 10000,
                                           alpha = 1, #FOR LASSO
                                           grouped = FALSE,
                                           parallel= TRUE
                                          )

                                                   }
                    )
  l_min<- as.matrix(unlist(as.matrix(cv_loop_run[,9 ,drop=FALSE] ))) # matrix  #9  is lamda.min

  l_1se<- as.matrix(unlist(as.matrix(cv_loop_run[,10 ,drop=FALSE] ))) # matrix  #10  is lamda.1se

If you look at the code for `cv.glmnet`, when `parallel = TRUE`, there is already a `foreach` `%dopar%` loop. Therefore, (I could be wrong but) I don't believe you will get any increased performance from wrapping the cv.glmnet function within another `foreach` `%dopar` loop. It's like having two nested parallel `foreach` loops, where your inner loop is already using all cores. — jav, Sep 05 '16 at 23:16
@jav The inner loop, for which I already have set the parallel = TRUE loops through the 1,000 lambdas. The outer %dopar% loop is looping through 10,000 cv.glmnet runs. — J. Doe., Sep 06 '16 at 08:10

score 0 · Accepted Answer · answered Sep 07 '16 at 09:08

Ok, so I found it myself. All I have to do is restrict the output of each cv.glmnet run. That way only the minimum and the 1se lambdas are getting picked up from each run. This means that this:

cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                       nlambda = 10000,
                                       alpha = 1, #FOR LASSO
                                       grouped = FALSE,
                                       parallel= TRUE
                                      )

becomes this:

cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                       nlambda = 10000,
                                       alpha = 1, #FOR LASSO
                                       grouped = FALSE,
                                       parallel= TRUE
                                      )[9:10]

cv.glmnet parallel and memory issue

1 Answers1