elastic parallelism and fault-tolerance in distributed Julia

Question

How does Julia expose fault-tolerance - when a node goes down (intentionally or not) and when communication between nodes goes down.

I saw a few mentions of such a feature but could not find out exactly how it can be done.

score 3 · Accepted Answer · answered Mar 07 '17 at 20:24

In the pmap docstrings you can see that this has already been implemented there using the retry_ keyword arguments.

pmap([::AbstractWorkerPool], f, c...; distributed=true, batch_size=1,
on_error=nothing, retry_n=0, retry_max_delay=DEFAULT_RETRY_MAX_DELAY,
retry_on=DEFAULT_RETRY_ON) -> collection

... Any error stops pmap from processing the remainder of the collection. To override this behavior you can specify an error handling function via argument on_error which takes in a single argument, i.e., the exception. The function can stop the processing by rethrowing the error, or, to continue, return any value which is then returned inline with the results to the caller.

Failed computation can also be retried via retry_on, retry_n, retry_max_delay, which are passed through to retry as arguments retry_on, n and max_delay respectively. If batching is specified, and an entire batch fails, all items in the batch are retried.

I don't think there's anything like this for the @parallel macro. But you can use the Base.wrap_on_error & Base.wrap_retry functions to extend your original function to deal with errors. You can see much of the implementation details by looking through the definition of pmap at https://github.com/JuliaLang/julia/blob/v0.5.0/base/pmap.jl.

The basic strategy is just to catch the error (and potentially the data) and retry using the same worker, or another if that one is down. I think.

elastic parallelism and fault-tolerance in distributed Julia

1 Answers1