Dask client scatter is taking a long time for size of file dict in memory

Question

I'm new to Dask and have recently made my foray into parallel computing with this nice and wonderful package. However, in my implementation, I've been struggling to understand why does it take 6 mins for me to scatter a python dict in my scheduler workstation's memory, to my workers.

The dict is not huge. A sys.sizeof(mydict) shows me that it is 41943152 bytes. Would it make a difference if I use dask or numpy array? It's not the network constraint I'm pretty sure since I was able to copy a 400MB file into the worker terminal in under 15 seconds.

My setup is one other worker workstation (2 proc x 1 thread), with my scheduler station is also setup as a worker station (4 proc x 1 thread). Any help would be appreciated!

future_dict = my_vc.e1.dict_of_all_sea_mesh_edges

[future_dict] = c.scatter([future_dict])

Log:

Scattering dict_of_all_sea_mesh_edges to cluster execution started
Scattering dict_of_all_sea_mesh_edges to cluster completed in 00 HOURS :06 MINUTES :46.67 SECONDS

In[2]: sys.getsizeof(my_vc.e1.dict_of_all_sea_mesh_edges)/1000000

Out[2]: 41.943152

I've edited your response but my brain hit a syntax error when mentally parsing your first code line. — Jacob Birkett, Sep 15 '18 at 01:50
Much apologies, I left out a line break which makes the code really unreadable. I've fixed that. If you have any clue on how I can fix this it would be great! — Winston Tan, Sep 15 '18 at 02:29
`[future_dict] = c.scatter([future_dict])` is still not correct syntax. I'm really confused by this. — Jacob Birkett, Sep 15 '18 at 19:54
I am scattering the dict, and then reassigning the future to the same reference future_dict. It ran without problems in my computer I'm not sure what syntax error might you be referring to? — Winston Tan, Sep 16 '18 at 12:20

Dask client scatter is taking a long time for size of file dict in memory

0 Answers0