I was running a function on a pool of N single-threaded workers (on N machines) with client.map and one of the workers failed. I was wondering if there is a way to automatically handle exceptions raised by a worker, to redistribute its failed tasks to other workers, and to ignore or exclude it from the pool?
I've tried simulating the issue with the methods shown below. To cause one worker to fail I raise an OSError on it in my_function
, which is submitted to client.map
like so: futures = client.map(my_function, range(100))
. In my example, the worker on 'Computer123' will be the one to fail. To handle exceptions thrown by my_function
, I use sys.exit in exception_handler
. So when a task fails on a worker, sys.exit is called. The result is that the bad worker's distributed.nanny catches the failure and restarts the worker while the client redistributes its failed tasks. But once the bad worker is back up again, it receives tasks again because it's still in the pool. It fails again and the process repeats. As it continues to fail, eventually the other workers complete all the tasks. It would be ideal if I could automatically handle exceptions from bad workers like 'Computer123' and remove it from the pool. Maybe removing it from the pool is all I need to do?
@exception_handler
def my_function(x):
import socket
import time
time.sleep(5)
if socket.gethostname() == 'Computer123':
raise(OSError)
else:
return x**2
def exception_handler(orig_func):
def wrapper(*args,**kwargs):
try:
return orig_func(*args,**kwargs)
except:
import sys
sys.exit(1)
return wrapper