What are recommended dask-kubernetes configuration overrides for long-running tasks?

Question

I am using something along the lines of the example provided in the docs

import dask.bag
from dask_kubernetes import KubeCluster


cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.adapt(minimum=0, maximum=24, interval="20000ms")
dag = dask.bag.from_sequence(tasks).map(lambda x: make_task(x).execute())

with distributed.Client(dask_cluster) as client:
    results = dag.compute(scheduler=client)

cluster.close()

In my case, the execute() function does a lot of IO and takes approximately 5-10 mins to run. I want to configure the KubeCluster and dask scheduler in a way that will maximize the chances of all going well with these long-running tasks.

My question has two parts. First, how do I override a distributed configuration setting? I wanted to try something like

dask.config.set({'scheduler.work-stealing': False})

but I don't know what is the right place to set this. Specifically, I don't know if this is something that every worker should be aware of or if it is something that I can get away with specifying only at the point where I instantiate the KubeCluster.

The second part of my question has to do with recommendations for tasks that will be long-running (more than a few minutes). I have been experimenting with this using the default settings. Sometimes everything goes well and sometimes the compute() call fails with the following exception:

  <... omitting caller from the traceback ...>
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 2587, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1885, in gather
    asynchronous=asynchronous,
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 767, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 345, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 329, in f
    result[0] = yield future
  File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1741, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('lambda-364defe33868bf6e4864da2933065a12', 3)", <Worker 'tcp://172.18.7.71:39029', name: 9, memory: 0, processing: 4>)

I am running a recent commit from the master branch: dask-kubernetes@git+git://github.com/dask/dask-kubernetes.git@add93d56ba1ac2f7d00576bd3f2d1be0db3e1757.

Edit:

I updated my code snippet to show that I am calling the adapt() function with a minimum number of workers set to 0. I started wondering if getting to 0 workers could cause the scheduler to shutdown before it returns the compute() result.

score 0 · Answer 1 · answered May 26 '20 at 09:15

First, how do I override a distributed configuration setting?

You can override settings by either modifying the configuration YAML files ot by setting environment variables.

So in your case you can either update your ~/.config/dask/distributed.yaml file.

distributed:
  scheduler:
    work-stealing: false

Or by setting the environment variable.

export DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=False

Sometimes everything goes well and sometimes the compute() call fails with the following exception...

A KilledWorker exception happens for a number of reasons. We have included a documentation page about common cases.

Most often I find it is because the task used more memory than was available and it was killed by the OOM killer.

What are recommended dask-kubernetes configuration overrides for long-running tasks?

1 Answers1