Dask Event loop was unresponsive - work not parallelized

Question

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS.

I'm able to start up the scheduler on the first machine:

I then start up workers on several other machines. From the other machines I'm able to access the scheduler using nc -zv ${HOST} ${PORT}, and the workers otherwise seem to be able to connect to the master, as evidenced by the worker's sysout: Registered to: tcp://10.201.101.108:31001, but almost immediately the worker complains about a timeout loop.

From the master node, in my Jupyter notebook I then connect to the scheduler:

dask_client = Client('10.201.101.108:31001')

But the work does not propagate to the worker nodes (worker-node CPU stays at <1%) or even to the worker running on the same machine as the scheduler. This is a highly parallelized task and when running on a single machine (i.e., using Client(processes=False) consumes every core on the machine).

score 1 · Answer 1 · answered Jan 04 '18 at 00:44

1

It is not uncommon to see the "Event loop was unresponsive" wanring when first connecting, depending on your network.

Some things to check

client.get_versions(check=True)
Does client.scheduler_info()['workers'] have anything? If not then you might have some trouble connecting
Consider looking at the worker logs with client.get_worker_logs()
Try running a simple computation like client.submit(lambda x: x + 1, 10).result()

answered Jan 04 '18 at 00:44

MRocklin

55,641
23
163
235

`client.get_versions(check=True)` shows all of the nodes and their software versions, but both `client.scheduler_info()['workers']` and `client.get_worker_logs()` hang indefinitely. That makes it seem like it's a network connectivity issue, but if true how would the scheduler have been able to return results from `client.get_versions(check=True)` if it were not able to connect to the worker nodes? – user554481 Jan 04 '18 at 16:41
Actually, I take that back: `client.scheduler_info()['workers']` and `client.get_worker_logs()` aren't hanging. They're able to return results about all of the worker nodes quickly and without problems – user554481 Jan 04 '18 at 16:55
I've now also tried on my personal Mac (work Mac might have been more locked down) and the issue persists. – user554481 Jan 21 '18 at 19:00

Dask Event loop was unresponsive - work not parallelized

1 Answers1