1

dask.distributed keeps data in memory on workers until that data is no longer needed. (Thanks @MRocklin!)

While this process is efficient in terms of network usage, it will still result in frequent pickling and unpickling of data. I assume this is not a zero copy pickling, like memmapping would do for plasma or joblib in case of numpy arrays.

It's clear that when calculation result is needed on another host or other worker, it has to be pickled and sent over. This can be avoided when relying on threaded parallelism inside the same host, as all calculations access the same memory (--nthreads=8). dask is using this trick and simply accesses the result of calculations instead of pickling.

When we're using process based parallelization instead of threading in a single host (--nproc=8 --nthreads=1), I'd expect that dask pickles and unpickles again to send data through to the other worker. Is this correct?

Mark Horvath
  • 1,136
  • 1
  • 9
  • 24
  • Did you notice the costs of these steps? SER/DES-pickle costs are awfully high for anz ultra low-latency processing. The threading tricks ( if you compare 'em to process-based methods ) are ignoring the costs of waiting for monopolistic GIL-lock ( in 2020 / 2H still ) governing re-[SERIAL]-isation of any amount of threads back into a just plain duck-duck-go one-thread-(with GIL)-after-another-thread-(with GIL) and all others keep waiting ... **Definitely** not a sort of [tag:HPC] performance, is it? – user3666197 Jul 03 '20 at 23:57
  • Sure, it's my aim to avoid picling and allocating memory again when unpickling, but then horses for causes. `dask` is great in identifying common sub-trees of calculations and can save me a lot of calculation time by optimal execution. Once data is sent over to another worker it should be quite fast. I'm comparing it to tools like `ray`, which for example passes results through plasma (and offer memmapping with zero copy read), but probably not able to recognize common parts of larger calculation trees, so will actually calculate more. – Mark Horvath Jul 04 '20 at 10:43
  • There is an open discussion about using shared memory if two workers are on the same host: https://github.com/dask/dask/issues/6267 – Mark Horvath Jul 04 '20 at 11:14

1 Answers1

2

In case the same process needs the calculation result (continuing based on the temporary result), is it smart enough to keep a cache in the given process and reuse results there?

Yes. Dask keeps data in memory on workers until that data is no longer needed. https://distributed.dask.org/en/latest/memory.html

Edit: When Dask moves data between processes on the same host then yes, it serializes data and moves it across a local socket and then deserializes it. It doesn't necessarily use pickle to serialize. It depends on the type.

MRocklin
  • 55,641
  • 23
  • 163
  • 235