1

Let's say I have a simple costly function that stores some results to a file:

def costly_function(filename):
    time.sleep(10)
    with open('filename', 'w') as f:
        f.write("I am done!)

Now let's say I would like to schedule a number of these tasks in dask, which then takes these requests asynchronously and runs these functions one by one. I'm currently setting up a dask client object...

cluster = dask.distributed.LocalCluster(n_workers=1, processes=False)  # my attempt at sequential job processing
client = dask.distributed.Client(cluster)

... and then interactively (from IPython) scheduling these jobs:

>>> client.schedule(costly_function, "result1.txt")
>>> client.schedule(costly_function, "result2.txt")
>>> client.schedule(costly_function, "result3.txt")

The issue that I'm getting is that these tasks are not running consecutively but in parralel, which in my particular case is causing concurrency issues.

So my question is: What is the correct way to set up a job queue like the one I described above in dask?

Raven
  • 648
  • 1
  • 7
  • 18

1 Answers1

2

Ok, I think I might have a solution (feel free to come up with better ones though!). It requires modifying the previous costly function slightly:

def costly_function(filename, prev_job=None):
    time.sleep(10)
    with open('filename', 'w') as f:
        f.write("I am done!")

cluster = dask.distributed.LocalCluster(n_workers=1, processes=False)  # my attempt at sequential job processing
client = dask.distributed.Client(cluster)

And then in interactive context you would write the following:

>>> future = client.submit(costly_function, "result1.txt")
>>> future = client.submit(costly_function, "result2.txt", prev_job=future)
>>> future = client.submit(costly_function, "result3.txt", prev_job=future)
MRocklin
  • 55,641
  • 23
  • 163
  • 235
Raven
  • 648
  • 1
  • 7
  • 18
  • I've modified your answer a bit. You don't need to call `.result`. This is done automatically. Also, the method name is submit, not schedule. – MRocklin Dec 07 '19 at 00:41
  • hey, thanks for the edit! Can you explain why the call to .result() is not necessary in this case? I don't know how exactly this is done automatically. – Raven Dec 09 '19 at 13:53
  • 1
    When you include a future as an argument to a submit call Dask identifies it as a data dependency. It waits until that future has finished computing before it runs the new task, and passes in the computed result, rather than the future. You can learn more about Dask futures at https://docs.dask.org/en/latest/futures.html – MRocklin Dec 09 '19 at 16:04