How to submit a large set of long running parallel tasks to dask?

Question

I have a computational workload that I originally ran with concurrent.futures.ProcessPoolExecutor which I converted to use dask so that I could make use of dask's integrations with distributed computing systems for scaling beyond one machine. The workload consists of two task types:

Task A: takes string/float inputs and produces a matrix (around 2000 x 2000). Task duration is usually 60 seconds or less.
Task B: takes the matrix from task A and uses it and some other small inputs to solve an ordinary differential equation. The solution is written to disk (so no return value). Task duration can be up to fifteen minutes.

There can multiple B tasks for each A task.

Originally, my code looked like this:

a_results = client.map(calc_a, a_inputs)
all_b_inputs = [(a_result, b_input) for b_input in b_inputs for a_result in a_results]
b_results = client.map(calc_b, all_b_inputs)
dask.distributed.wait(b_results)

because that was the clean translation from the concurrent.futures code (I actually kept the code so that it could be run either with dask or concurrent.futures so I could compare). client here is a distributed.Client instance.

I have been experiencing some stability issues with this code, especially for large numbers of tasks, and I think I might not be using dask in the best way. Recently, I changed my code to use Delayed instead like this:

a_results = [dask.delayed(calc_a)(a) for a in a_inputs]
b_results = [dask.delayed(calc_b)(a, b) for a in a_inputs for b in b_inputs]
client.compute(b_results)

I did this because I thought perhaps the scheduler could work through the tasks more efficiently if it examined the entire graph before starting anything rather than beginning to schedule the A tasks before knowing about the B tasks. This change seems to help some but I still see some stability issues.

I can create separate questions for the stability problems, but I first wanted to find out if I am using dask in the best way for this use case or if I should modify how I am submitting the tasks. Just to describe the problems briefly, the worst problem to me is that over time my workers drop to 0% CPU and tasks stop completing. Other problems include things like getting KilledWorker exceptions and seeing log messages about an unresponsive loop and time outs. Usually the scheduler runs fine for at least a few hours, completing thousands of tasks before these issues show up (which makes debugging difficult since the feedback loop is so long).

Some questions I have been wondering about:

I can have thousands of tasks to run. Can I submit these all to dask to start out or do I need to submit them in batches? My thought was that the dask scheduler would be better at scheduling tasks than my batching code.
If I do need to batch things myself, can I query the scheduler to find out the maximum number of workers so I can write something that will submit batches of the right size? Or do I need to make the batch size an input to my batching code?
In the end, my results all get written to disk and nothing gets returned. With the way I am running tasks, are resources getting held onto longer than necessary?
My B tasks are long but they could be split by scheduling tasks that solve for solutions at intermediate time steps and feeding those in as the inputs to subsequent solving tasks. I think I need to do this any way because I would like to use an HPC cluster with a timed queue and I think I need to use the lifetime parameter to retire workers to keep them from running over the time limit and that works best with short-lived tasks (to avoid losing work when shut down early). Is there an optimal way to split the B task?

score 1 · Accepted Answer · answered Jul 20 '21 at 08:39

1

There are lots of questions here, but with regards to the code snippets you provided, both look correct, but the futures version will scale better in my experience. The reason for that is that by default, whenever one of the delayed tasks fails, the computation of all delayed tasks halts, while futures can proceed as long as they are not directly affected by the failure.

Another observation is that delayed values will tend to hold on to resources after completion, while for futures you can at least .release() them once they have been completed (or use fire_and_forget).

Finally, with very large task lists, it might be worth to make them a bit more resilient to restarts. One basic option is to create simple text files after successful completion of a task, and then on restart check which tasks need to be re-computed. Fancier options include prefect and joblib.memory, but if you don't need all the bells and whistles, the text file route is often fastest.

answered Jul 20 '21 at 08:39

SultanOrazbayev

14,900
3
16
46

Yes, I don't like the size of this question but I worry about the X-Y problem if I broke it up because my subquestions might be going down the wrong path. – ws_e_c421 Jul 21 '21 at 15:01
My initial testing with `delayed` shows improvement compared to my futures results. With futures, the task A progress bar would go fully solid in the dashboard before many B tasks completed. With delayed, I see task A complete around 2x the number of workers and then keep that margin ahead of the number of completed B tasks (for a 1:1 ratio of A to B tasks). Also, the left hand side of the progress bars turn a lighter color which I think means the resources of those tasks have been released. With futures, the bars stay solid. I can try experimenting with `release` and `fire_and_forget`. – ws_e_c421 Jul 21 '21 at 15:16
Just this week, I did exactly what you suggested of writing a text file and skipping those tasks on re-run. Previously, I had built in a check of the text file at the beginning of the B tasks (so it would be very short if it had already been run). Removing the tasks before submitting to dask has shown an improvement over having the tasks end immediately. I run into the stability issues less frequently on re-run with fewer tasks. – ws_e_c421 Jul 21 '21 at 15:25
If you haven't seen this already, this answer might be relevant: https://stackoverflow.com/a/61925097/10693596 – SultanOrazbayev Jul 21 '21 at 15:31
Thanks, I have tried `as_completed(..., with_results=True)` to try to get dask to release the resources of completed tasks but hadn't seen any difference from the example code. I could experiment with that more along with `release` and `fire_and_forget`. – ws_e_c421 Jul 21 '21 at 15:53
A couple of years ago I had a potentially similar problem, and after spending about a few days trying to get it done in parallel I ended up splitting the task into small chunks (that dask scheduler could process without problems) and then iterated over that. Wasn't fast/efficient but solved what I had to solve back then. :/ – SultanOrazbayev Jul 21 '21 at 16:01
"whenever one of the delayed tasks fails, the computation of all delayed tasks halts" -- the tasks I am running shouldn't fail (should not raise exceptions). I have noticed that since using Delayed I get tasks with the status "Erred" in the dashboard which I didn't get with Future (maybe Future retries them?). I still see stability issues like KilledWorker exceptions. KilledWorker is raised on my `.compute()` call and stops that code. However, the other tasks queued by `.compute()` appear to proceed any way. I am not sure if this contradicts what you say about all delayed tasks halting. – ws_e_c421 Jul 23 '21 at 18:32
I accepted your answer and will try to find ways to split out smaller questions about my stability issues (KilledWorker exceptions, all workers dropping to 0% CPU, erred tasks, etc.). – ws_e_c421 Jul 23 '21 at 18:37
Thank you, sure, hopefully it will be easier to resolve smaller questions. – SultanOrazbayev Jul 23 '21 at 19:13

How to submit a large set of long running parallel tasks to dask?

1 Answers1