This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.
I'm able to start up the scheduler on the first machine:
I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}
, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001
, but almost immediately the worker complains about a timeout loop.
From the master node, in my Jupyter notebook I then connect to the scheduler:
dask_client = Client('10.201.101.108:31001')
But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False)
consumes every core on the machine).