0

First, read this question: Repeated task execution using the distributed Dask scheduler

Now, when Dask decides to rerun a task due to worker stealing or a task failing (as a result of memory limits per process for example), which task result gets passed to the next node of the DAG? We are using nested tasks, e.g.

@dask.delayed
def add(n):
    return n+1

t_a = add(1)
t_b = add(t_a)
the_output = add(add(add(t_b)))

So if one of these tasks fails, or gets stolen, and is run twice, which result gets passed to the next node in the DAG?

Further background for those interested: The reason this has come up is that our task writes to a database. If it runs twice, we get an integrity error because it is trying to insert the same record twice (constrained on id and version in combination). The current plan is to make the task idempotent by catching the integrity error in the task but I still don't understand how Dask "chooses" a result.

medley56
  • 1,181
  • 1
  • 14
  • 29

1 Answers1

0

If you have a situation like add(add(add(t_b)))

Or more generally

x = add(1)
y = add(x)
z = add(y)

Even though those all use the same function, they are all separate tasks. Dask sees that they have different inputs and so it treats them differently.

So if one of these tasks fails, or gets stolen, and is run twice, which result gets passed to the next node in the DAG?

In all of these cases, there is only one valid result on the cluster at once. A stolen task is only run on the new machine, not the old one. If the result of a task is lost and has to be rerun then only the new value will be present anywhere (the old value was lost, remember).

MRocklin
  • 55,641
  • 23
  • 163
  • 235