1

I am starting Dask with a containerized LocalCluster but on closing the cluster and client I usually (but intermittently) receive a diverse range of exceptions - see for example the one below.

The cleanup code is:

cluster.close()
client.close()

Is there an Exception-free way to close the Dask cluster, followed by closing the client(*)?

No solution I have found has resolved the issue for me. Surely it is possible to exit without Exception??

I would prefer a route that avoids use if the with statement, because the clean-up operation is embedded in a third-party class. If a context manager is the sole way to go, is it possible to call the relevant context manager directly without the with?

  • PS Do I have it right that the cluster should be closed before the client (since the latter would presumably be required to achieve the former)? (Opinion seems to differ on the matter.)
2023-05-22 20:59:26,890 - distributed.client - ERROR -
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pip_packages/distributed/comm/core.py", line 291, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/pip_packages/distributed/comm/tcp.py", line 511, in connect
    convert_stream_closed_error(self, e)
  File "/pip_packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x40deb132b0>: ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pip_packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/pip_packages/distributed/client.py", line 1298, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/pip_packages/distributed/client.py", line 1328, in _ensure_connected
    comm = await connect(
  File "/pip_packages/distributed/comm/core.py", line 315, in connect
    await asyncio.sleep(backoff)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 655, in sleep
    return await future
asyncio.exceptions.CancelledError
2023-05-22 20:59:26 : ERROR : __exit__ : 768 :
Traceback (most recent call last):
  File "/pip_packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pip_packages/distributed/client.py", line 1500, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
  File "/pip_packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/pip_packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Client->Scheduler local=tcp://127.0.0.1:49712 remote=tcp://127.0.0.1:43205>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pip_packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/pip_packages/distributed/client.py", line 1508, in _handle_report
    await self._reconnect()
  File "/pip_packages/distributed/utils.py", line 742, in wrapper
    return await func(*args, **kwargs)
  File "/pip_packages/distributed/client.py", line 1298, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/pip_packages/distributed/client.py", line 1328, in _ensure_connected
    comm = await connect(
  File "/pip_packages/distributed/comm/core.py", line 315, in connect
    await asyncio.sleep(backoff)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 655, in sleep
    return await future
asyncio.exceptions.CancelledError

Thanks as ever!

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
jtlz2
  • 7,700
  • 9
  • 64
  • 114

1 Answers1

1

Context managers are a good practice (I'd also say 'best practice', but not sure if that would be widely supported).

If a context manager is the sole way to go, is it possible to call the relevant context manager directly without the with?

The relevant context manager functions are defined in __enter__ and __exit__, but calling them directly is not a good practice.

Do I have it right that the cluster should be closed before the client (since the latter would presumably be required to achieve the former)?

In general a given cluster could serve multiple clients, and so one typically would only open/close client connection, while the cluster start/shutdown would be governed by the relevant process. This should clarify the order/hierarchy. In cases where the cluster is created just for a specific purpose (so multiple client connections are not expected), it's still a good idea to follow this order and close the client first. This order is easiest to follow when using multiple context managers:

from time import sleep

from distributed import Client, LocalCluster

if __name__ == "__main__":
    with LocalCluster() as cluster, Client(cluster) as client:
        futs = client.map(sleep, range(10))
        print(*client.gather(futs))  # will print None 10 times

In the above, to instantiate a client, one needs to have a cluster first.

While this answer doesn't answer directly the exceptions you mention, I would either try to find a reproducible example, or make sure I am running the latest dask version and use the context managers.

The fact that a third-party class does clean-up and it affects client/cluster logic is indicative of poor separation of concerns, so a more efficient solution is to address this problem.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    Thank you very very much for this helpful and authoritative answer, so grateful. I do indeed have multiple client connections and the cluster is a (chargeable) resource on ECS so I also need cluster shutdown. I expect I have a race condition between the clients. Do you have any pointers and/or should I update my Q? Thanks again – jtlz2 May 23 '23 at 12:21
  • I would highly recommend using Coiled for cloud+dask, they have a lot of useful functionality for managing costs/resources/etc. Another vendor is Saturn cloud, but I never tried their services. – SultanOrazbayev May 23 '23 at 13:35
  • 1
    Thank you - I'm afraid a buy scenario is not possible for me :( – jtlz2 May 23 '23 at 22:17
  • I should have mentioned this also: https://github.com/dask/dask-cloudprovider – SultanOrazbayev May 24 '23 at 02:22