DASK - Stopping workers during execution causes completed tasks to be launched twice

Question

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)

So it's more or less like this:

from dask import bag, delayed batches = bag.from_sequence(my_batches()) results = batches.map(process_batch_and_store_results_in_database) graph = delayed(read_database_and_store_bundled_result_into_s3)(results) client = Client('the_scheduler:8786') client.compute(graph)

And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.

I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?

This issue seems somehow related to this: https://github.com/dask/distributed/issues/847 — Tony Lâmpada, May 19 '17 at 14:38
Also related: http://stackoverflow.com/questions/41965253/repeated-task-execution-using-the-distributed-dask-scheduler/41965766#41965766 — Tony Lâmpada, May 19 '17 at 22:17

score 1 · Answer 1 · answered May 20 '17 at 06:26

I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.

from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')

futures = client.map(process_batch_and_store_results_in_database, my_batches())

seq = as_completed(futures)
del futures # now only reference to the futures is within seq

for future in seq:
    pass  # let future be garbage collected

DASK - Stopping workers during execution causes completed tasks to be launched twice

1 Answers1