2

I am using Dask with Slurm cluster:

cluster = SLURMCluster(cores=64, processes=64, memory="128G", walltime="24:00:00")
#export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=100
cluster.adapt(minimum_jobs=1, maximum_jobs=2, interval="20 s", target_duration="100 s", wait_count=20)

My workload is two-by-two reduction from ~1000 nodes to 1. Each reduction takes ~2min. So, there are may parallel at the beginning, but less parallel at the end. I can only access two nodes in the cluster. So I expect it use two cluster nodes at the beginning and cluster one node at the end.

# pseudo code
def reduce(task):
    futures = []
    for i in range(0, len(task), 2):
        futures.append(client.submit(reduceTwo(), task[i]. task[i+1]))
    while len(futures) != 1:
        futures_new = []
        for i in range(0, len(futures), 2):
            futures_new.append(client.submit(reduceTwo(), futures[i].result(), futures[i+1].result()))
        futures = futures_new
    return futures[0].result()

However, my problem is that, when the cluster.adapt() is expected to reduce from 2 cluster nodes to 1 cluster node, it will reduce to 0 first and start a new node.

Question 1: Is it normal to reduce to 0? It would actually not be a problem, if the output data staying in the memory of killed nodes could be saved properly (maybe in scheduler node, the login node of the cluster). However, I read the log and it seems that, before the worker can normally stoped/retired, it was killed too early. Some workers retire and some do not.

Question 2: Is this "kill before retire" possible to happen? and are there any way to give workers longer time to retire?. You can see in the first code above, I try to increase as many timing parameters as possible, but it does not work. I do not totally understand this parameter list.

I know that I can optimize my code. Like del futures we computation finish, so that a worker's memory stage task will be 0 and its death won't cause too many computations to be redone. Or, there can be same reduction lib can use. But, anyway, can these two Dask problems be solved?

1 Answers1

1

Answer my own question, in case anyone else sees the same problem.

The key is: do not let all nodes save their temp file into a same directory in the same shared disk.

Without specifying the local_directory, it is very easy that all nodes want to save the workers local files into the ~/dask-worker-space directory, which is shared among all nodes. Then, there's competition among all nodes to read/write in this directory. Then, when one node wants to kill its workers, it may accidentally kill workers in the other nodes, ending up (Q1) nodes number reduce to 0. And also (Q2) fail to move the data of killed workers.

I am hoping Dask can support all nodes writing into the same dask-worker-space. This is really the nature behavior, I mean, when I just want a quick usage of Dask to do some parallel, my intuition won't tell me: "set the local_directory, otherwise the program will crash".