I am using something along the lines of the example provided in the docs
import dask.bag
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.adapt(minimum=0, maximum=24, interval="20000ms")
dag = dask.bag.from_sequence(tasks).map(lambda x: make_task(x).execute())
with distributed.Client(dask_cluster) as client:
results = dag.compute(scheduler=client)
cluster.close()
In my case, the execute()
function does a lot of IO and takes approximately 5-10 mins to run. I want to configure the KubeCluster
and dask scheduler in a way that will maximize the chances of all going well with these long-running tasks.
My question has two parts. First, how do I override a distributed
configuration setting? I wanted to try something like
dask.config.set({'scheduler.work-stealing': False})
but I don't know what is the right place to set this. Specifically, I don't know if this is something that every worker should be aware of or if it is something that I can get away with specifying only at the point where I instantiate the KubeCluster
.
The second part of my question has to do with recommendations for tasks that will be long-running (more than a few minutes). I have been experimenting with this using the default settings. Sometimes everything goes well and sometimes the compute()
call fails with the following exception:
<... omitting caller from the traceback ...>
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 2587, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1885, in gather
asynchronous=asynchronous,
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 767, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 345, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 329, in f
result[0] = yield future
File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1741, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('lambda-364defe33868bf6e4864da2933065a12', 3)", <Worker 'tcp://172.18.7.71:39029', name: 9, memory: 0, processing: 4>)
I am running a recent commit from the master branch: dask-kubernetes@git+git://github.com/dask/dask-kubernetes.git@add93d56ba1ac2f7d00576bd3f2d1be0db3e1757
.
Edit:
I updated my code snippet to show that I am calling the adapt()
function with a minimum number of workers set to 0. I started wondering if getting to 0 workers could cause the scheduler to shutdown before it returns the compute()
result.