13

My training dataset has about 200,000 records and I have 500 features. (These are sales data from a retail org). Most of the features are 0/1 and is stored as a sparse matrix.

The goal is to predict the probability to buy for about 200 products. So, I would need to use the same 500 features to predict the probability of purchase for 200 products. Since glmnet is a natural choice for model creation, I thought about implementing glmnet in parallel for the 200 products. (Since all the 200 models are independent) But I am stuck using foreach. The code I executed was:

foreach(i = 1:ncol(target)) %dopar%
{
assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}

model is a list - having the list of 200 model names where I want to store the respective models.

The following code works. But it doesn't exploit the parallel structure and takes about a day to finish !

for(i in 1:ncol(target))
{ assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}

Can someone point to me on how to exploit the parallel structure in this case?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Rouse
  • 191
  • 1
  • 2
  • 5
  • Did you register a parallel backend in the second case? Which one? Do you want to run on a single multicore computer or on a cluster? – Steve Weston Feb 11 '14 at 16:32
  • For the second one - I used the parallel option in glmnet. From what I understand, it uses that to parallelize the CV. I am on running on a single multicore computer (quad core with 16 gb ram) – Rouse Feb 11 '14 at 17:58

2 Answers2

30

In order to execute "cv.glmnet" in parallel, you have to specify the parallel=TRUE option, and register a foreach parallel backend. This allows you to choose the parallel backend that works best for your computing environment.

Here's the documentation for the "parallel" argument from the cv.glmnet man page:

parallel: If 'TRUE', use parallel 'foreach' to fit each fold. Must register parallel before hand, such as 'doMC' or others. See the example below.

Here's an example using the doParallel package which works on Windows, Mac OS X, and Linux:

library(doParallel)
registerDoParallel(4)
m <- cv.glmnet(x, target[,1], family="binomial", alpha=0, type.measure="auc",
               grouped=FALSE, standardize=FALSE, parallel=TRUE)

This call to cv.glmnet will execute in parallel using four workers. On Linux and Mac OS X, it will execute the tasks using "mclapply", while on Windows it will use "clusterApplyLB".

Nested parallelism gets tricky, and may not help a lot with only 4 workers. I would try using a normal for loop around cv.glmnet (as in your second example) with a parallel backend registered and see what the performance is before adding another level of parallelism.

Also note that the assignment to "model" in your first example isn't going to work when you register a parallel backend. When running in parallel, side-effects generally get thrown away, as with most parallel programming packages.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • 3
    (+1) This answer is correct. I've discovered, though, that if my design matrix is too large, R won't take advantage of the additional workers because I don't have enough memory for additional copies of it! – Sycorax Feb 04 '15 at 16:44
  • @user777 You might want to try using workers on multiple computers in order to get access to more aggregate memory. That can be done with either doParallel or doMPI, but is a bit of work unless you have access to a well setup Linux cluster. – Steve Weston Feb 04 '15 at 17:11
  • 1
    Of course! The real trick is convincing your boss that you need so may computers... :) I only said that to point out to OP that his desktop-appearing setup may not be sufficient. – Sycorax Feb 04 '15 at 17:17
7

Stumbled upon this old thread and thought it would be useful to mention that with the future framework, it is possible to do nested and parallel foreach() calls. For instance, assume you have three local machines (which SSH access) and you want to run four cores on each, then you can use:

library("doFuture")
registerDoFuture()
plan(list(
  tweak(cluster, workers = c("machine1", "machine2", "machine3")),
  tweak(multiprocess, workers = 4L)
))


model_fit <- foreach(ii = seq_len(ncol(target))) %dopar% {
  cv.glmnet(x, target[,ii], family = "binomial", alpha = 0,
            type.measure = "auc", grouped = FALSE, standardize = FALSE,
            parallel = TRUE)
}
str(model_fit)

The "outer" foreach-loop will iterate over the targets such that each iteration is processed by a separate machine. Each iteration will in turn process cv.glmnet() using four workers on whatever machine it ends up on.

(Of course, if you only got access to a single machine, then it makes little sense to do nested parallel processing. I such cases, you can use:

plan(list(
  sequential,
  tweak(multiprocess, workers = 4L)
))

to parallelize the cv.glmnet() call, or alternatively,

plan(list(
  tweak(multiprocess, workers = 4L),
  sequential
))

, or equivalently just plan(multiprocess, workers = 4L), to parallelize over targets.

HenrikB
  • 6,132
  • 31
  • 34
  • I think it would be useful to mention how would one define/assign the machines, and what other prerequisites are necessary in order for everything to work? – runr Oct 29 '21 at 06:20