I'm new to Dask and have recently made my foray into parallel computing with this nice and wonderful package. However, in my implementation, I've been struggling to understand why does it take 6 mins for me to scatter a python dict in my scheduler workstation's memory, to my workers.
The dict is not huge. A sys.sizeof(mydict)
shows me that it is 41943152 bytes. Would it make a difference if I use dask or numpy array? It's not the network constraint I'm pretty sure since I was able to copy a 400MB file into the worker terminal in under 15 seconds.
My setup is one other worker workstation (2 proc x 1 thread), with my scheduler station is also setup as a worker station (4 proc x 1 thread). Any help would be appreciated!
future_dict = my_vc.e1.dict_of_all_sea_mesh_edges
[future_dict] = c.scatter([future_dict])
Log:
Scattering dict_of_all_sea_mesh_edges to cluster execution started
Scattering dict_of_all_sea_mesh_edges to cluster completed in 00 HOURS :06 MINUTES :46.67 SECONDS
In[2]: sys.getsizeof(my_vc.e1.dict_of_all_sea_mesh_edges)/1000000
Out[2]: 41.943152