20

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?

MRocklin
  • 55,641
  • 23
  • 163
  • 235

1 Answers1

33

This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.

Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.

Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.

You can control this behavior by modifying the following entry in your ~/.config/dask/distributed.yaml file.

allowed-failures: 3     # number of retries before a task is considered bad
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 7
    The logs I see through the tracking UI contain no helpful details in my case, as to why the tasks are crashing. Are there additional logs to seek? dump files to look at in a certain filesystem path?! – matanster Aug 18 '18 at 19:20
  • As standard practice, dask workers log to stdout by default. You can redirect this output to a file when you set up your workers. – MRocklin May 14 '19 at 13:06
  • How can I change this parameter if I don't have the file `~/.dask/config.yaml` – BND Jun 22 '19 at 08:54
  • 3
    This has moved to `~/.config/dask/distributed.yaml`. I've updated the answer – MRocklin Jun 22 '19 at 10:35
  • This was very useful @MRocklin! One thing that would make it easier to understand is including a stack track from a failed task in the stack trace on the client. `KilledWorker` isn't clear what the problem with the task is, or even that it's a problem with the task... – Maximilian Jun 22 '21 at 18:41
  • 1
    @MRocklin so I got a KilledWorker error like this ``` KilledWorker('task-8-b0923903c62c435cb3b750bdebdb8cc7', ) ``` Can you please explain what memory: 0 and processing: 5 mean? (thanks) – marwan Jun 07 '22 at 12:32
  • @Maximilian Take a look at `dask_client.forward_logging()` from 2023.6.0... – jtlz2 Jun 15 '23 at 19:05