0

I'm attempting to run the Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) example in a fresh Anaconda environment.

I set up an environment using

conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi

Inside the environment, I run

mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json

Then, from a python interpreter on the same machine (and in the same folder), I run

from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')

This results in the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
    self.start(timeout=timeout)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
    sync(self.loop, self._start, **kwargs)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
    six.reraise(*error[0])
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
    result[0] = yield future
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
    timedelta(seconds=timeout), self._update_scheduler_info()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

The terminal that I ran dask-mpi from does not have any output which would indicate that something is trying to connect. I have verified that the port in question, 8786, is open. I've also verified via debugger that the client is getting the correct address from the scheduler file.

I've tried this in quite a few different environments and on a few different machines, including a fresh Ubuntu 18.04 docker container. I'm completely at a loss for what steps I might be missing.

nleaf
  • 58
  • 3

1 Answers1

0

It turns out this was due to an error in newer versions of dask.distributed (1.25.3) which broke the behavior of dask-mpi. This seems to be fixed as of dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3).

nleaf
  • 58
  • 3