24

I am using the R package foreach() with %dopar% to do long (~days) calculations in parallel. I would like the ability to stop the entire set of calculations in the event that one of them produces an error. However, I have not found a way to achieve this, and from the documentation and various forums I have found no indication that this is possible. In particular, break() does not work and stop() only stops the current calculation, not the whole foreach loop.

Note that I cannot use a simple for loop, because ultimately I want to parallelize this using the doRNG package.

Here is a simplified, reproducible version of what I am attempting (shown here in serial with %do%, but I have the same problem when using doRNG and %dopar%). Note that in reality I want to run all of the elements of this loop (here 10) in parallel.

library(foreach)
myfunc <- function() {
  x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %do% {
    cat("Element ", k, "\n")
    Sys.sleep(0.5) # just to show that stop does not cause exit from foreach
    if(is.element(k, 2:6)) {
      cat("Should stop\n")
      stop("Has stopped")
    }
    k
  }
  return(x)
}
x <- myfunc()
# stop() halts the processing of k=2:6, but it does not stop the foreach loop itself.
# x is not returned. The execution produces the error message
# Error in { : task 2 failed - "Has stopped"

What I would like to achieve is that the entire foreach loop can be exited immediately upon some condition (here, when the stop() is encountered).

I have found no way to achieve this with foreach. It seems that I would need a way to send a message to all the other processes to make them stop too.

If not possible with foreach, does anyone know of alternatives? I have also tried to achieve this with parallel::mclapply, but that does not work either.

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] foreach_1.4.0

loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_3.0.0  iterators_1.0.6
GSee
  • 48,880
  • 13
  • 125
  • 145
Coryn Bailer-Jones
  • 289
  • 1
  • 2
  • 7
  • 1
    Would it not be possible to use `for` instead? – Maxim.K Apr 18 '13 at 10:04
  • No, because ultimately I want to parallelize this using the doRNG package. (Sorry I didn't make that clear in my original post: I've edited it to make this explicit.) – Coryn Bailer-Jones Apr 18 '13 at 10:27
  • 3
    Based on your other comments, you might want to have each sub-process able to set a 'flag' object on failure, and make that object available for reading by all sub-processes. They'd all have to have some internal breakpoint or equivalent which regularly checks the value of the 'flag,' so they all can self-terminate. – Carl Witthoft Apr 18 '13 at 12:21

6 Answers6

13

It sounds like you want an impatient version of the "stop" error handling. You could implement that by writing a custom combine function, and arranging for foreach to call it as soon as each result is returned. To do that you need to:

  • Use a backend that supports calling combine on-the-fly, like doMPI or doRedis
  • Don't enable .multicombine
  • Set .inorder to FALSE
  • Set .init to something (like NULL)

Here's an example that does that:

library(foreach)
parfun <- function(errval, n) {
  abortable <- function(errfun) {
    comb <- function(x, y) {
      if (inherits(y, 'error')) {
        warning('This will leave your parallel backend in an inconsistent state')
        errfun(y)
      }
      c(x, y)
    }
    foreach(i=seq_len(n), .errorhandling='pass', .export='errval',
            .combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
      if (i == errval)
        stop('testing abort')
      Sys.sleep(10)
      i
    }
  }
  callCC(abortable)
}

Note that I also set the error handling to "pass" so foreach will call the combine function with an error object. The callCC function is used to return from the foreach loop regardless of the error handling used within foreach and the backend. In this case, callCC will call the abortable function, passing it a function object that is used force callCC to immediately return. By calling that function from the combine function we can escape from the foreach loop when we detect an error object, and have callCC return that object. See ?callCC for more information.

You can actually use parfun without a parallel backend registered and verify that the foreach loop "breaks" as soon as it executes a task that throws an error, but that could take awhile since the tasks are executed sequentially. For example, this takes 20 seconds to execute if no backend is registered:

print(system.time(parfun(3, 4)))

When executing parfun in parallel, we need to do more than simply break out of the foreach loop: we also need to stop the workers, otherwise they will continue to compute their assigned tasks. With doMPI, the workers can be stopped using mpi.abort:

library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
r <- parfun(getDoParWorkers(), getDoParWorkers())
if (inherits(r, 'error')) {
  cat(sprintf('Caught error: %s\n', conditionMessage(r)))
  mpi.abort(cl$comm)
}

Note that the cluster object can't be used after the loop aborts, because things weren't properly cleaned up, which is why the normal "stop" error handling doesn't work this way.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • +1 for the comment, and for your book that helped me a lot :) – statquant Apr 19 '13 at 19:28
  • I am a bit confused about what is causing the early exit if it is no longer a stop() in the combine function comb(). I presume the stop() in the foreach is triggering the call of comb(). Is it then errfun() which causes the early exit? But what is errfun()? It is not explicitly defined (and the name is arbitrary). Also, when I run parfun(6,12) with %dopar% and doMPI on 4 cores, the execution continues for i=5,7,8,9 (verified using the sink() approach in my answer below), so I'm not sure it really is stopping early when run in parallel. – Coryn Bailer-Jones Apr 23 '13 at 14:35
  • The `stop` in the `foreach` loop simply causes an error object to be returned as the task result to the master. Since error handling is "pass", `foreach` passes it to the combine function, and does so immediately due to the specified options. If the combine function calls `errfun`, the combine function won't return to its caller, but to `callCC`. But as I say in the revised answer, that has no effect on the workers, which is why mpi.abort is needed. – Steve Weston Apr 23 '13 at 15:23
  • I believe the additional output that you see from the workers is because they continue as normal until `mpi.abort` is called. – Steve Weston Apr 23 '13 at 15:26
  • Actually, the additional output is produced after mpi.abort has been called (and several seconds after Caught error' is printed) and I see in "top" that the processes are still running. So although we are breaking out of the foreach loop (and thus preventing later workers from starting) the existing workers are not being stopped. So it seems that this approach does not allow me to instantly stop all workers on an error in one of them. I experimented with passing `cl` into the `foreach` loop and using `mpi.abort(cl$comm)` there, but (not surprisingly) this didn't work. – Coryn Bailer-Jones Apr 24 '13 at 05:59
  • It stops almost instantly on my Linux machine using Rmpi 0.6.1 built with Open MPI 1.4.3. When spawning 6 workers, the whole run takes about 2 seconds and I'm able to verify with "top" that everything was killed. I don't know why mpi.abort isn't working for you. – Steve Weston Apr 25 '13 at 16:09
4

It's not a direct answer to your question, but using when() you can avoid entering the loop if a condition is satisfied:

x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %:%
  when( !is.element(k, 2:6) ) %do%
  {
    cat("Element ", k, "\n")
    Sys.sleep(0.5)
    k
  }

EDIT:

I forgot something: I think it's by design, that you cannot just stop the foreach loop. If you run the loop in parallel, each turn is processed independently, which means when you stop the entire loop for k=2 it is not predictable if the process for k=1 terminated already or is still running. Hence, using the when() condition gives you a deterministic result.

EDIT 2: Another solution considering your comment.

shouldStop <- FALSE
x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %do%
  {
    if( !shouldStop ){
      # put your time consuming code here
      cat("Element ", k, "\n")
      Sys.sleep(0.5)
      shouldStop <- shouldStop ||  is.element(k, 2:6)
      k
    }
  }

Using this solution, the processes which are running while the stop condition becomes true are still calculated to an end, but you avoid time consumption on all upcoming processes.

Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • The problem is that I will only know whether I want to exit from the loop once I've done some of the calculations in the loop. Yet it's exactly these calculations which I want to parallelize with this loop. (In other words, the condition itself is time-consuming the calculate.) – Coryn Bailer-Jones Apr 18 '13 at 11:17
  • EDIT 2 is a useful suggestion, but the way I run this is that the number of things to process is equal to the number of CPU cores available (10-50). Thus all processes are started simultaneously, and there are no future processes to avoid starting. As it is now, I must wait for all of these to finish before I get the error message from stop(). A work around would be for me to manually kill the entire program as soon as I see the message produced by cat() (in my post), but this is impractical because this is a long run (~1 day) and runs in the background on a remote machine. – Coryn Bailer-Jones Apr 18 '13 at 11:53
  • 1
    This information changes the whole thing and should be mentioned in the original post. However, I must admit that in this case my ideas are limited. You could try to control your nodes directly using for example `clusterApply` from the `snow` package and call `stopCluster()` when the first job with the desired result is done. But be aware that calling `stopCluster()` from a slave process will not only give ugly errors. Additionally, the result will not return to the master. Maybe someone else has an idea how the result can be passed? – Beasterfield Apr 18 '13 at 13:12
3

Instead of trying to break out of a loop, I write a small file to the disk when I've reached my terminal loop, then have all remaining iterations simply skip based on the existence of that file.

Checking if a file exists costs us less than a milisecond of computing time.

# 1.4 seconds to check if a file exists a million times
system.time(lapply(1:1e6, function(x) file.exists("checker.txt")))
   user  system elapsed 
  1.204   0.233   1.437 

This is great when you don't have a fixed number of iterations or your process can finish before all of the iterations are complete (like a convergence, for example)

library(foreach)

alist <- foreach(i = 1:5000) %dopar% { 
  if(file.exists("checker.txt")) {
    return(NULL)
  } else {
    if(i = 20) {
      write("", "checker.txt") # write an empty file
    }
    return(i)
  }
}

file.remove("checker.txt")

The great thing about this is that even if your list is extremely long, if you just unlist() you only get the values.

> length(alist)
[1] 5000

> unlist(res)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Don't bother trying to break, instead, just "skip the rest"!

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Brilliant - was stuck for a long time on this one. I used this method within a try/catch statement. When a subprocess generates an error, it skips the remaining iterations. And the script starts again the foreach loop with some modifications (not producing an error this time). – marsei Sep 05 '21 at 01:06
  • this is great! Thanks – Carlos Llosa Oct 07 '21 at 15:03
1

The answer I got from REvolution Technical support: "no--foreach doesn't currently have a way to stop all parallel computations on an error to any one".

Coryn Bailer-Jones
  • 289
  • 1
  • 2
  • 7
0

I am not having much luck getting foreach to do what I want, so here is a solution using the parallel package which seems to do what I want. I use the intermediate option in mcparallel() to pass results from my function, do.task(), immediately to the function check.res(). If do.task() throws an error, then this is used in check.res() to trigger calling tools::pskill to explicitly kill all workers. This might not be very elegant, but it works in the sense that it causes an instant stop of all worked. Furthermore, I can simply inherit all the variables I need for the processing in do.task() from the present environment. (In reality do.task() is a much more complex function requiring many variables to be passed in.)

library(parallel)

# do.task() and check.res() inherit some variables from enclosing environment

do.task <- function(x) {
  cat("Starting task", x, "\n")
  Sys.sleep(5*x)
  if(x==stopat) { 
    stop("Error in job", x) # thrown to mccollect() which sends it to check.res()
  }
  cat("  Completed task", x, "\n")
  return(10*x)
}

check.res <- function(r) { # r is list of results so far
  cat("Called check.res\n")
  sendKill <- FALSE
  for(j in 1:Njob) { # check whether need to kill
    if(inherits(r[[j]], 'try-error')) {
      sendKill <- TRUE
    }
  }
  if(sendKill) { # then kill all
    for(j in 1:Njob) {
      cat("Killing job", job[[j]]$pid, "\n") 
      tools::pskill(job[[j]]$pid) # mckill not accessible
    }
  }
}

Tstart <- Sys.time()
stopat <- 3
Njob <- 4
job <- vector("list", length=Njob)
for(j in 1:Njob) {
  job[[j]]<- mcparallel(do.task(j))
}
res <- mccollect(job, intermediate=check.res) # res is in order 1:Njob, regardless of how long jobs took
cat("Collected\n")
Tstop <- Sys.time()
print(difftime(Tstop,Tstart))
for(j in 1:Njob) {
  if(inherits(res[[j]], 'try-error')) {
    stop("Parallel part encountered an error")
  }
}

This gives the following screen dump and results for variable res

> source("exp5.R")
Starting task 1 
Starting task 2 
Starting task 3 
Starting task 4 
  Completed task 1 
Called check.res
Called check.res
  Completed task 2 
Called check.res
Called check.res
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Called check.res
Killing job 21423 
Killing job 21424 
Killing job 21425 
Killing job 21426 
Collected
Time difference of 15.03558 secs
Error in eval(expr, envir, enclos) : Parallel part encountered an error
> res
$`21423`
[1] 10

$`21424`
[1] 20

$`21425`
[1] "Error in do.task(j) : Error in job3\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in do.task(j): Error in job3>

$`21426`
NULL
Coryn Bailer-Jones
  • 289
  • 1
  • 2
  • 7
-1

Steve Weston's original answer essentially answered this. But here is a slightly modified version of his answer which also preserves two additional features in the way I need them: (1) random number generation; (2) printing run-time diagnostics.

suppressMessages(library(doMPI))

comb <- function(x, y) {
  if(inherits(y, 'error')) {
    stop(y)
  }
  rbind(x, y) # forces the row names to be 'y'
}

myfunc <- function() {
  writeLines(text="foreach log", con="log.txt")
  foreach(i=1:12, .errorhandling='pass', .combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
    set.seed(100)
    sink("log.txt", append=TRUE)
    if(i==6) {
      stop('testing abort')
    }
    Sys.sleep(10)
    cat("Completed task", i, "\n")
    sink(NULL)
    rnorm(5,mean=i)
  }
}

myerr <- function(e) {
  cat(sprintf('Caught error: %s\n', conditionMessage(e)))
  mpi.abort(cl$comm)
}

cl <- startMPIcluster(4)
registerDoMPI(cl)
r <- tryCatch(myfunc(), error=myerr)
closeCluster(cl)

When this file is sourced, it exits as intended with an error message

> source("exp2.R")
    4 slaves are spawned successfully. 0 failed.
Caught error: testing abort
[ganges.local:16325] MPI_ABORT invoked on rank 0 in communicator  with errorcode 0

The 'log.txt' files provides correct diagnostics up to the point of error, and then provides additional error information. Crucially, the execution of all tasks is halted as soon as the stop() in the foreach loop is encountered: it does not wait until the entire foreach loop has completed. Thus I only see the 'Completed task' message up to i=4. (Note that if Sys.sleep() is shorter, then later tasks may be started before the mpi.abort() is processed.)

If I change the stop condition to be "i==100", then the stop and hence the error is not triggered. The code successfully exists without an error message, and r is a 2D array with dimensions 12*5.

Incidentally, it seems that I don't actually need .inorder=FALSE (I think that just gives me a small speed increase in the event that an error is found).

Coryn Bailer-Jones
  • 289
  • 1
  • 2
  • 7
  • I changed my answer because I discovered it was taking advantage of incorrect error handling in doMPI. Executing `stop` in the combine function shouldn't abort foreach, and that is fixed in the development version of doMPI on R-forge, so your answer won't work when that is released. – Steve Weston Apr 23 '13 at 12:52
  • If you don't set `.inorder=FALSE`, then the combine function won't be called until all previous tasks have been processed by the combine function. So if the failing task isn't the very first task, it will take at least 10 seconds to abort in your example. – Steve Weston Apr 23 '13 at 12:59