dask: How do I avoid timeout for a task?

Question

In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:

tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
  File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
    future.result()
concurrent.futures._base.CancelledError

They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)

Here's the dask portion of the second traceback:

  ... my code...

  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
    direct=direct)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
    asynchronous=asynchronous)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
    traceback)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
    seq = list(seq)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
    yield f(*a)

  ... my code ....

Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?

score 1 · Accepted Answer · answered Jan 02 '19 at 01:35

1

Dask does not impose a timeout on tasks by default.

The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.

The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?

In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html

answered Jan 02 '19 at 01:35

MRocklin

55,641
23
163
235

1

Is it possible to specify a global task timeout? My problem is that one worker is stuck on some task, CPU load is high, but its heartbeat is ok and and it never finishes. Cluster is essentially stuck. I have one big green bar (that never moves) left in the "Tasks Processing" tab of the dashboard. – Anatoly Alekseev Mar 21 '22 at 00:26
@AnatolyAlekseev If it's just a one-off problem, can you not just restart the worker? – Rehan Rajput Feb 01 '23 at 15:47
@AnatolyAlekseev, If the problem is about a delayed task that takes a very long time to finish and is eating up the CPU, then you can do it in 4 steps: 1. Get the list of tasks that are currently being processed using `client.processing()`. 2. Extract the names of the workers which are running your long hanging task. 3. Cancel the futures belonging to the task - so it's not rescheduled. `client.cancel()` 4. restart the workers.`client.restart_workers`. – Rehan Rajput Feb 01 '23 at 15:54

dask: How do I avoid timeout for a task?

1 Answers1