I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches:
Client.scatter(broadcast=True)
: This required sending all the data from one machine (where I'm running the client and the scheduler) which creates a bandwidth bottleneck.Client.submit
followed byClient.replicate
: These workers share a filesystem, so rather than send the data, I can schedule the task that loads the data, then replicate the data to all workers. This seems to use a tree-strategy to distribute the data, which is faster than the previous option.
However, it is potentially faster to force every worker to run the load data function locally, rather than load the data on one worker and serialize it to from worker to worker. Is there a way to do this? Client.run
seems like part of what I want, but I need to get back a future for the loaded data that I can pass to other tasks later.