5

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches:

  • Client.scatter(broadcast=True): This required sending all the data from one machine (where I'm running the client and the scheduler) which creates a bandwidth bottleneck.
  • Client.submit followed by Client.replicate: These workers share a filesystem, so rather than send the data, I can schedule the task that loads the data, then replicate the data to all workers. This seems to use a tree-strategy to distribute the data, which is faster than the previous option.

However, it is potentially faster to force every worker to run the load data function locally, rather than load the data on one worker and serialize it to from worker to worker. Is there a way to do this? Client.run seems like part of what I want, but I need to get back a future for the loaded data that I can pass to other tasks later.

2 Answers2

0

The short answer here is "no", there is no straightforward way to accomplish this. One could hack something together though if you are comfortable using internal code (which may change without warning).

Another way would be to define the computational behavior in how the object is serialized, and then just call the function again in the deserialization code.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

I had the exact same problem which I asked on StackOverflow and solved recently, see this for my solution.

user8871302
  • 123
  • 7