Dask : how to parallelize and serialize methods?

Question

I am trying to parallize methods from a class using Dask on a PBS cluster.

My greatest challenge is that this method should parallelize some computations, then run further parallel computations on the result. Of course, this should be distributed on the cluster to run similar computations on other data...

The cluster is created :

cluster = PBSCluster(cores=4,
                     memory=10GB,
                     interface="ib0",
                     queue=queue,
                     processes=1,
                     nanny=False,
                     walltime="02:00:00",
                     shebang="#!/bin/bash",
                     env_extra=env_extra,
                     python=python_bin
                    )
cluster.scale(8)
client = Client(cluster)

The class I need to distribute has 2 separate steps which do have to be run separately since step1 writes a file that is then read at the beginning of the second step.

I have tried the following by putting both steps one after the other in a method :

def computations(params):
    my_class(**params).run_step1(run_path)
    my_class(**params).run_step2()

chain = []
for p in params_compute:
    y = dask.delayed(computations)(p)
    chain.append(y)

dask.compute(*chain)

But it does not work because the second step is trying to read the file immediately. So I need to find a way to stop the execution after step1.

I have tried to force the execution of first step by adding a compute() :

def computations(params):
    my_class(**params).run_step1(run_path).compute()
    my_class(**params).run_step2()

But it may not be a good idea because when running dask.compute(*chain) I'd be ultimately doing compute(compute()) .. which might explain why the second step is not executed ?

What would the best approach be ?

Should I include a persist() somewhere at the end of step1 ?

For info, step1 and step2 below :

def run_step1(self, path_step):          
    preprocess_result = dask.delayed(self.run_preprocess)(path_step)  
    gpu_result = dask.delayed(self.run_gpu)(preprocess_result)
    post_gpu = dask.delayed(self.run_postgpu)(gpu_result) # Write a result file post_gpu.tif
    return post_gpu

def run_step2(self):
    data_file = rio.open(self.outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
    temp_result1 = self.process(data_file ) 
    final_merge = dask.delayed(self.merging)(temp_result1 )       
    write =dask.delayed(self.write_final)(final_merge )    
    return write

Having nested `delayed` is problematic, so `run_step1` is returning a delayed object... — SultanOrazbayev, Feb 02 '21 at 18:17
So maybe the best solution to avoid nested delayed would be to use submit to distribute? — Mike, Feb 02 '21 at 20:41

score 1 · Accepted Answer · answered Feb 03 '21 at 01:40

This is only a rough suggestion, as I don't have a reproducible example as a starting point, but the key idea is to pass a delayed object to run_step2 to explicitly link it to run_step1. Note I'm not sure how essential for you it is to use a class in this case, but for me it's easier to pass the params as a dict explicitly.

def run_step1(params):
# params is assumed to be a dict
# unpack params here if needed (path_step was not explicitly in the `for p in params_compute:` loop so I assume it can be stored in params)
    preprocess_result = run_preprocess(path_step, params)
    gpu_result = run_gpu(preprocess_result, params)
    post_gpu = run_postgpu(gpu_result, params) # Write a result file post_gpu.tif
    return post_gpu

def run_step2(post_gpu, params):
# unpack params here if needed
    data_file = rio.open(outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
    temp_result1 = process(data_file, params) 
    final_merge = merging(temp_result1, params)
    write = write_final(final_merge, params)
    return write

chain = []
for p in params_compute:
    y = dask.delayed(run_step1)(p)
    z = dask.delayed(run_step2)(y, p)
    chain.append(z)

dask.compute(*chain)

score 1 · Answer 2 · answered Feb 03 '21 at 13:02

1

Sultan's answer almost works, but fails due to an internal misconception in the library I was provided.

I have used the following workaround which works for now (I'll use your solution later). I simply create 2 successive chains and compute them one after the other. Not really elegant but works fine...

chain1 = []
for p in params_compute:
    y = (run_step1)(p)
    chain1.append(y)
dask.compute(chain1)

chain2 = []
for p in params_compute:
    y = (run_step2)(p)
    chain2.append(y)
dask.compute(chain2)

answered Feb 03 '21 at 13:02

Mike

893
7
22

1

As a hack, if you want to do it in a somewhat parallel way with this code, you can split `params_compute` into several sublists and iterate over those sublists. Not ideal, but will give you some `chain2` results faster than waiting for all `chain1` computations to finish. – SultanOrazbayev Feb 03 '21 at 13:53

Dask : how to parallelize and serialize methods?

2 Answers2