How to replicate data when it is faster to compute than transfer in dask distributed?

Question

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches:

Client.scatter(broadcast=True): This required sending all the data from one machine (where I'm running the client and the scheduler) which creates a bandwidth bottleneck.
Client.submit followed by Client.replicate: These workers share a filesystem, so rather than send the data, I can schedule the task that loads the data, then replicate the data to all workers. This seems to use a tree-strategy to distribute the data, which is faster than the previous option.

However, it is potentially faster to force every worker to run the load data function locally, rather than load the data on one worker and serialize it to from worker to worker. Is there a way to do this? Client.run seems like part of what I want, but I need to get back a future for the loaded data that I can pass to other tasks later.

score 0 · Accepted Answer · answered Jun 25 '18 at 15:58

The short answer here is "no", there is no straightforward way to accomplish this. One could hack something together though if you are comfortable using internal code (which may change without warning).

Another way would be to define the computational behavior in how the object is serialized, and then just call the function again in the deserialization code.

score 0 · Answer 2 · answered Aug 29 '18 at 03:42

0

I had the exact same problem which I asked on StackOverflow and solved recently, see this for my solution.

answered Aug 29 '18 at 03:42

user8871302

123
7

How to replicate data when it is faster to compute than transfer in dask distributed?

2 Answers2