5

I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute().

When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15 secs, the scheduler starts rerunning some of the tasks (or executes them in parallel more than once).

This is a problem since the tasks modify a SQL db and if they run again they end up raising an Exception (due to DB uniqueness constraints). I'm not setting pure=True anywhere (and I believe the default is False). Other than that, the Dask graph is trivial (no dependencies between the tasks).

Still not sure if this is a feature or a bug in Dask. I have a gut feeling that this might be related to worker stealing...

Grisha Levit
  • 8,194
  • 2
  • 38
  • 53
Daniel
  • 53
  • 4

1 Answers1

4

Correct, if a task is allocated to one worker and another worker becomes free it may choose to steal excess tasks from its peers. There is a chance that it will steal a task that has just started to run, in which case the task will run twice.

The clean way to handle this problem is to ensure that your tasks are idempotent, that they return the same result even if run twice. This might mean handling your database error within your task.

This is one of those policies that are great for data intensive computing workloads but terrible for data engineering workloads. It's tricky to design a system that satisfies both needs simultaneously.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks. At least now I can stop trying to debug this. So it's feature. Maybe you can update the online documentation to highlight that this is a possibility? Also, not sure what's the point of having the "pure" argument if a task can run multiple times either way. – Daniel Jan 31 '17 at 20:18
  • I also raised a github issue asking for a way to switch off work stealing for tasks that should run only once: https://github.com/dask/distributed/issues/847 – Arco Bast Feb 01 '17 at 07:18
  • If this happens and a task is run twice (e.g.), which task result is used in Dask's DAG? Is it just a straight up race? – medley56 Dec 17 '19 at 16:58
  • It's not a race, but it is a random-ish choice. – MRocklin Dec 18 '19 at 17:08