How do I ignore a worker whose tasks have failed and redistribute its tasks to other workers?

Question

I was running a function on a pool of N single-threaded workers (on N machines) with client.map and one of the workers failed. I was wondering if there is a way to automatically handle exceptions raised by a worker, to redistribute its failed tasks to other workers, and to ignore or exclude it from the pool?

I've tried simulating the issue with the methods shown below. To cause one worker to fail I raise an OSError on it in my_function, which is submitted to client.map like so: futures = client.map(my_function, range(100)). In my example, the worker on 'Computer123' will be the one to fail. To handle exceptions thrown by my_function, I use sys.exit in exception_handler. So when a task fails on a worker, sys.exit is called. The result is that the bad worker's distributed.nanny catches the failure and restarts the worker while the client redistributes its failed tasks. But once the bad worker is back up again, it receives tasks again because it's still in the pool. It fails again and the process repeats. As it continues to fail, eventually the other workers complete all the tasks. It would be ideal if I could automatically handle exceptions from bad workers like 'Computer123' and remove it from the pool. Maybe removing it from the pool is all I need to do?

@exception_handler
def my_function(x):
  import socket 
  import time
  time.sleep(5)
  if socket.gethostname() == 'Computer123':
    raise(OSError)
  else:
    return x**2

def exception_handler(orig_func):
  def wrapper(*args,**kwargs):
    try:
      return orig_func(*args,**kwargs)
    except:
      import sys
      sys.exit(1)
  return wrapper

I would have commented on [How to find why a task fails in dask distributed?](https://stackoverflow.com/questions/39647019/how-to-find-why-a-task-fails-in-dask-distributed), but I don't have the reputation to do so. — billiam, Mar 27 '19 at 19:15

score 0 · Answer 1 · answered Mar 27 '19 at 19:20

0

As a workaround, you could keep a dictionary of bad workers, adding the hostname to it each time you determine it is bad (perhaps after it raises a certain number of exceptions).

Then when you want to issue some task, check if it is in the offending list. Something like:

  if socket.gethostname() in badHosts:
    skip
  else:
    do_something()

If you can give more details on how you manage the pool you connect to, I may be able to offer some more advice on how to remove them directly instead of having to check each time.

answered Mar 27 '19 at 19:20

Salvatore

10,815
4
31
69

I don't know who the bad workers are before run time. I'm trying to automatically handle exceptions at run time including removing any eventual bad workers from the pool. – billiam Mar 27 '19 at 19:35
I think I understand now. You don't know a host is bad until after it fails the job, and once it is identified you don't want it to get any more jobs, right? How do you set up the pool? – Salvatore Mar 27 '19 at 19:56
That's correct. Sorry to edit the post so much. I'm trying to explain the issue as best I can. I have a cluster of machines. One machine hosts the scheduler. Each of the other machines hosts a single-threaded worker pointed to the scheduler. Beyond the simulation I did, it shouldn't matter that I have multiple machines hosting the workers. All that matters is that I have multiple workers. – billiam Mar 27 '19 at 20:26

How do I ignore a worker whose tasks have failed and redistribute its tasks to other workers?

1 Answers1