6

I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.

Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).

Example1

require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
    r <- foreach(i=1:iter, .combine=rbind) %dopar% {
        ind <- sample(100, 100, replace=TRUE)
        result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
        coefficients(result1)
    }
})[3]

Do you have any idea what could be the reason? Could memory be the cause?

I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.

UPDATE1

My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:

Example2

require("doParallel")
require("foreach")

registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
    r <- foreach(i=1:iter, .combine=rbind) %dopar% {
        i
    }
})[3]

I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.

UPDATE2

As for Example2, @SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."

However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:

Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?

Community
  • 1
  • 1
989
  • 12,579
  • 5
  • 31
  • 53
  • Have you tried explicitly creating and registering the cluster? For example `cl <- makePSOCKcluster(8); registerDoParallel(cl)`? – Hack-R Jun 10 '16 at 15:22
  • Even with `iter=15e+6`? Could you perhaps comment your hardware specifications (CPU and memory) and type of OS on which you run the code? – 989 Jun 10 '16 at 15:25
  • Yea I just copied and pasted your code and it ran until I killed it. I saw the 8 `Rscript.exe` processes in my Resource Monitor though they were only using a small amount of the CPU core they were running on (but that's not necessarily a problem -- it can fluctuate due to inefficient chunking). I was running it on Windows Server 2008 with 24 physical CPU's. – Hack-R Jun 10 '16 at 15:27
  • Yes I also tried creating and registering the clusters as you pointed out to no avail. `registerDoParallel()` is also OK as it's written in the vignette document provided by package authors. I have the feeling that It has something to do with the memory. – 989 Jun 10 '16 at 15:31
  • 1
    Yea, it probably does have to do with memory or compute resources. So do you really *need* that many iterations? If so, can you achieve the same result by chunking it into steps with fewer iterations and then combining (perhaps averaging) the results? I'm not sure what your use case is, but for Machine Learning you can often save model progress to storage memory (i.e. a checkpoint) so that you can break apart training big neural nets and other models into multiple sessions. – Hack-R Jun 10 '16 at 15:34
  • Yes I need. not as many as `15e+6` but about `3.8e+6`. I actually tried to test with toy example (like above) if every thing goes fine and then run my experiments. In my experiments, I need to repeat `foreach` for around `100` times. So I think I have to break down `100` times repetitions rather than `foreach` itself. – 989 Jun 10 '16 at 15:39
  • 1
    I am getting it to run all 8 (after 10-15 seconds of it getting itself sorted) on Windows with I7 4910 running your above code verbatim. – Bryan Goggin Jun 10 '16 at 15:41
  • Are you confirming that your workers are registered with `getDoParRegistered()` and `getDoParWorkers()`? – Bryan Goggin Jun 10 '16 at 15:43
  • Yes, I do. For `getDoParRegistered()` I get `TRUE` and `8` for `getDoParWorkers()`. – 989 Jun 10 '16 at 15:47
  • 1
    The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations. – Steve Weston Jun 14 '16 at 16:15
  • @SteveWeston Then if so, given that I have 8GB of memory available and taking the size of the objects created inside and outside of the loop into account, why the original example should not occupy all CPUs 100% for e.g. `iter=3e+6`? – 989 Jun 14 '16 at 16:42
  • @SteveWeston You may again check my update. – 989 Jun 14 '16 at 19:12
  • I'm starting to think there is a timeout for the client processes. This could be in the client, or even in something network security, which could be shutting down atypical ports after a certain time. – EngrStudent Jan 23 '19 at 14:22

2 Answers2

5

I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.

library(doSNOW)
library(tcltk)
nw <- 4  # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)

x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000  # may require tuning for your machine
maxcomb <- nw + 1  # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)

comb <- function(fobj, ...) {
  for(r in list(...))
    writeBin(r, fobj)
  fobj
}

final <- function(fobj) {
  close(fobj)
  t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}

mkprogress <- function(total) {
  pb <- tkProgressBar(max=total,
                      label=sprintf('total tasks: %d', total))
  function(n, tag) {
    setTkProgressBar(pb, n,
      label=sprintf('last completed task: %d of %d', tag, total))
  }
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')

r <-
  foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
          .maxcombine=maxcomb, .init=resultFile, .final=final,
          .inorder=FALSE, .options.snow=opts) %dopar% {
    do.call('c', lapply(seq_len(n), function(i) {
      ind <- sample(100, 100, replace=TRUE)
      result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
      coefficients(result1)
    }))
  }

I included a progress bar since this example takes several hours to execute.

Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:

idivix <- function(n, chunkSize) {
  i <- 1
  it <- idiv(n, chunkSize=chunkSize)
  nextEl <- function() {
    m <- nextElem(it)  # may throw 'StopIterator'
    value <- list(i=i, m=m)
    i <<- i + m
    value
  }
  obj <- list(nextElem=nextEl)
  class(obj) <- c('abstractiter', 'iter')
  obj
}

The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:

r <- 
  foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
    do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
      i
    }))
  }

Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • Thanks for your solution but when I run your code I get: `Warning: progress function failed: object 'pb' not found` – 989 Jun 12 '16 at 13:51
  • @m0h3n You can always disable the progress bar by not using the foreach `.options.snow` option. Are `pb` and `progress` both defined in the global environment? – Steve Weston Jun 12 '16 at 14:09
  • Yes, I just copy/paste your code and run it. But look at `comb` function and whose parameters and `for` loop inside. Is that normal? – 989 Jun 12 '16 at 15:39
  • @m0h3n The `comb` function uses the first argument differently than the subsequent arguments which requires the use of the foreach `.init` argument. I've included examples of this pattern as examples in some of the foreach-related packages. – Steve Weston Jun 12 '16 at 16:06
  • I'm wondering if you run your code before posting here. I get this: `Error: object 'opts' not found`. Please try to clean your environment and run the code you posted here. – 989 Jun 13 '16 at 08:50
  • @m0h3n It works on my Mac laptop and my Linux desktop, and I got a friend to test it to verify that it doesn't just work for me. Since 'opts' is defined after 'mkprogress' and before 'resultFile', the error message that you report makes no sense to me. – Steve Weston Jun 13 '16 at 21:06
  • OK. my bad, sorry. As I was running your code via `putty`, I, seemingly due to that, used to get that error. I tried to run it on another PC (local) and its fine now. How we can get those indices (i.e. `1,2,3,4,...,niter`) stored in `r` as the final result? Because in my experiments, I need to work with these indices. The question stated here is just a toy example. – 989 Jun 14 '16 at 13:37
  • Thanks for your update. You might check my update (original question) as well. – 989 Jun 14 '16 at 16:04
3

At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.

I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:

require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
  do.call('rbind', lapply(seq_len(n), function(i) {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }))
}

Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • Thanks, voted up. I will finalize this question soon. – 989 Jun 16 '16 at 08:44
  • If parallel approach in `Example1` does not work because the tasks are not compute intensive, then why it works (parallel I mean) for small iterations and not for large ones? – 989 Jun 16 '16 at 09:11