5

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns:

# Initial set of jobs
futures = [client.submit(job.run_simulation) for job in jobs]
pool = as_completed(futures, with_results=True)

while True:
    # Wait for a job to finish
    f, result = next(pool)

    # Exit condition
    if result == 'STOP':
        break

    # Do processing and maybe submit more jobs
    more_jobs = process_result(f, result)
    more_futures = [client.submit(job.run_simulation) for job in more_jobs]
    pool.update(more_futures)

Here's my problem: The function job.run_simulation that I am submitting can sometimes hang for a long time, and I want to time out this function - kill the task and move on if the run time exceeds a certain time limit.

Ideally, I'd like to do something like client.submit(job.run_simulation, timeout=10), and have next(pool) return None if the task ran longer than the timeout.

Is there any way that Dask can help me time out jobs like this?

What I've tried so far

My first instinct was to handle the timeout independently of Dask within the job.run_simulation function itself. I've seen two types of suggestions (e.g. here) for generic Python timeouts.

1) Use two threads, one for the function itself and one for a timer. My impression is this doesn't actually work because you can't kill threads. Even if the timer runs out, both threads have to finish before the task is completed.

2) Use two separate processes (with the multiprocessing module), one for the function and one for the timer. This would work, but since I'm already in a daemon subprocess spawned by Dask, I'm not allowed to create new subprocesses.

A third possibility is to move the code block to a separate script that I run with subprocess.run and use the subprocess.run built in timeout. I could do this, but it feels like a worst-case fallback scenario because it would take a lot of cumbersome passing of data to and from the subprocess.

So it feels like I have to accomplish the timeout at the level of Dask. My one idea here is to create a timer as a subprocess at the same time as I submit the task to Dask. Then if the timer runs out, use Client.cancel() to stop the task. The problem with this plan is that Dask might wait for workers to free up before starting the task, and I don't want the timer running before the task is actually running.

emitra17
  • 51
  • 2

1 Answers1

0

Your assessment of the problem seems correct to me and the solutions you went through are the same that I would consider. Some notes:

  1. Client.cancel is unable to stop a function from running if it has already started. These functions are running in a thread pool and so you run into the "can't stop threads" limitation. Dask workers are just Python processes and have the same abilities and limitations.
  2. You say that you can't use processes from within a daemon process. One solution to this would be to change how you're using processes in one of the following ways:

    • If you're using dask.distributed on a single machine then just don't use processes

      client = Client(processes=False)
      
    • Don't use Dask's default nanny processes, then your dask worker will be a normal process capable of using multiprocessing
    • Set dask's multiprocessing-context config to "spawn" rather than fork or forkserver

The clean way to solve this problem though is to solve it inside of your function job.run_simulation. Ideally you would be able to push this timeout logic down to that code and have it raise cleanly.

MRocklin
  • 55,641
  • 23
  • 163
  • 235