4

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.

I'm able to start up the scheduler on the first machine:

enter image description here

I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001, but almost immediately the worker complains about a timeout loop.

enter image description here

From the master node, in my Jupyter notebook I then connect to the scheduler:

dask_client = Client('10.201.101.108:31001')

But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False) consumes every core on the machine).

user554481
  • 1,875
  • 4
  • 26
  • 47

1 Answers1

1

It is not uncommon to see the "Event loop was unresponsive" wanring when first connecting, depending on your network.

Some things to check

  1. client.get_versions(check=True)
  2. Does client.scheduler_info()['workers'] have anything? If not then you might have some trouble connecting
  3. Consider looking at the worker logs with client.get_worker_logs()
  4. Try running a simple computation like client.submit(lambda x: x + 1, 10).result()
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • `client.get_versions(check=True)` shows all of the nodes and their software versions, but both `client.scheduler_info()['workers']` and `client.get_worker_logs()` hang indefinitely. That makes it seem like it's a network connectivity issue, but if true how would the scheduler have been able to return results from `client.get_versions(check=True)` if it were not able to connect to the worker nodes? – user554481 Jan 04 '18 at 16:41
  • Actually, I take that back: `client.scheduler_info()['workers']` and `client.get_worker_logs()` aren't hanging. They're able to return results about all of the worker nodes quickly and without problems – user554481 Jan 04 '18 at 16:55
  • I've now also tried on my personal Mac (work Mac might have been more locked down) and the issue persists. – user554481 Jan 21 '18 at 19:00