I am using Dask with Slurm cluster:
cluster = SLURMCluster(cores=64, processes=64, memory="128G", walltime="24:00:00")
#export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=100
cluster.adapt(minimum_jobs=1, maximum_jobs=2, interval="20 s", target_duration="100 s", wait_count=20)
My workload is two-by-two reduction from ~1000 nodes to 1. Each reduction takes ~2min. So, there are may parallel at the beginning, but less parallel at the end. I can only access two nodes in the cluster. So I expect it use two cluster nodes at the beginning and cluster one node at the end.
# pseudo code
def reduce(task):
futures = []
for i in range(0, len(task), 2):
futures.append(client.submit(reduceTwo(), task[i]. task[i+1]))
while len(futures) != 1:
futures_new = []
for i in range(0, len(futures), 2):
futures_new.append(client.submit(reduceTwo(), futures[i].result(), futures[i+1].result()))
futures = futures_new
return futures[0].result()
However, my problem is that, when the cluster.adapt() is expected to reduce from 2 cluster nodes to 1 cluster node, it will reduce to 0 first and start a new node.
Question 1: Is it normal to reduce to 0? It would actually not be a problem, if the output data staying in the memory of killed nodes could be saved properly (maybe in scheduler node, the login node of the cluster). However, I read the log and it seems that, before the worker can normally stoped/retired, it was killed too early. Some workers retire and some do not.
Question 2: Is this "kill before retire" possible to happen? and are there any way to give workers longer time to retire?. You can see in the first code above, I try to increase as many timing parameters as possible, but it does not work. I do not totally understand this parameter list.
I know that I can optimize my code. Like del futures we computation finish, so that a worker's memory stage task will be 0 and its death won't cause too many computations to be redone. Or, there can be same reduction lib can use. But, anyway, can these two Dask problems be solved?