I would like to distribute a larger object (or load from disk) when a worker loads and put it into a global variable (such as calib_data
). Does that work with dask workers?

- 724
- 1
- 9
- 24
1 Answers
Seems like the client method register_worker_callbacks can do what you want in this case. You will still need somewhere to put your variable, since in python there is no truly global scope. That somewhere could be any attribute of an imported module, for example, which, then, any worker would have access to. You could also add it as an attribute of the worker instance itself, but I see no obvious reason to want to do that.
One way which works, hijacking a randomly picked builtin module; but I do not particularly recommend this (see below)
def attach_var(name, value):
import re
re.__setattr__(name, value)
client.run(attach_var, 'x', 1)
def use_var():
# any function running on a worker can do this, via delayed or
# whatever method you pass with
import re
return re.x
client.run(use_var)
Before going ahead, though, have you already considered delayed(calib_data)
or scatter
, which will copy your variable to where its needed, e.g.,
futures = client.scatter(calib_data, broadcast=True)
or indeed loading the data in the workers using ordinary delayed
semantics
dcalib = dask.delayed(load_calib_data)()
work = dask.delayed(process_stuff)(dataset1, dcalib)

- 27,272
- 5
- 45
- 74
-
Thank you for the detailed description. I come from the ipyparallel world where you could just say workers['x'] = 5 and then there would be a global `x` which is 5 on all workers. I currently have done this: ```def init_workers(atom_data): globals()['ATOM_DATA'] = atom_data``` and then do client.run(init_workers) - maybe that is not what we want. what do you think @mdurant – Wolfgang Kerzendorf Jan 31 '19 at 17:24
-
I'm not certain you want to copy ipyparallel's approach, the delayed/futures way is much more explicit. Also, I'm not certain that attaching to globals will work (you can try), since any function you pass will have a different closure of globals. I would attach to a module. – mdurant Jan 31 '19 at 18:15
-
seemingly, I don't quite understand dasks purpose and how it differs from ipyparallel and joblib and the rest. I have 40k parametersets that I need to put through a python/Cython/C - hybrid that is openmped. I want each worker to run one task at the same time on a slurm/torque/sge kind of setup. I'm happy to ask this question as a proper stackoverflow question – Wolfgang Kerzendorf Jan 31 '19 at 20:09
-
I think you better had. I believe the answer above is sufficient for the question as posed. (Note that you can use Dask as a backend for joblib, including "scattering" a variable to all tasks) – mdurant Jan 31 '19 at 20:25
-
https://stackoverflow.com/questions/54469195/dask-joblib-ipyparallel-and-other-schedulers - let me know if this is clear or should be improved. – Wolfgang Kerzendorf Jan 31 '19 at 21:03