3

My Dask computation is slow. When I look at the status page of the diagnostics dashboard I see that most of the time is spent in disk-read-* and disk-write-* tasks.

What does this mean?

How do I diagnose this issue?

MRocklin
  • 55,641
  • 23
  • 163
  • 235

1 Answers1

3

When Dask workers start to run out of memory they write extra data to disk. This is recorded in the status page as a disk-write- task. When that data is needed again it is read from disk and a disk-read- task is shown on the status page. You might confirm this by looking at the upper left plot that shows memory use per worker, or by looking at the solid portion of the progress bars that show the number of tasks of each particular type that are still in memory.

Ways you can address this:

  1. Figure out why Dask needs to keep data in memory. Common causes:
    1. when you persist a lot of data
    2. when Dask has to keep a lot of intermediate results, such as in the case of a full shuffle, or computations that have a high cardinality of results
  2. Get more memory
  3. Get faster disk. Modern disk bandwidth has improved in the last few years. It's possible to get drives on consumer-grade personal laptops with 1-2GB/s bandwidth.
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    To add, Dask may not always handle text data in the most memory efficient manner. You may see performance improvements if you open the Dask config file `~/.dask/config.yaml` and change `worker-memory-target: 0.60` to `worker-memory-target: 1.00`. See http://distributed.readthedocs.io/en/latest/worker.html#memory-management for further details. – blahblahblah Feb 24 '18 at 00:02