3

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)

So it's more or less like this:

from dask import bag, delayed batches = bag.from_sequence(my_batches()) results = batches.map(process_batch_and_store_results_in_database) graph = delayed(read_database_and_store_bundled_result_into_s3)(results) client = Client('the_scheduler:8786') client.compute(graph)

And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.

I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?

Tony Lâmpada
  • 5,301
  • 6
  • 38
  • 50

1 Answers1

1

I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.

from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')

futures = client.map(process_batch_and_store_results_in_database, my_batches())

seq = as_completed(futures)
del futures # now only reference to the futures is within seq

for future in seq:
    pass  # let future be garbage collected
MRocklin
  • 55,641
  • 23
  • 163
  • 235