5

I'm on dask 1.1.1 (latest version) and I have started a dask scheduler at the commandline with this command:

$ dask-scheduler --port 9796 --bokeh-port 9797 --bokeh-prefix my_project
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://10.1.0.107:9796
distributed.scheduler - INFO -       bokeh at:                     :9797
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-pdnwslep
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://10.1.25.4:36310
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.1.25.4:36310
distributed.core - INFO - Starting established connection

then... I tried to start up a client to connect to the scheduler using this code:

from dask.distributed import Client
c = Client('10.1.0.107:9796', set_as_default=False)

but upon trying to do that, I get an error:

...
 File "/root/anaconda3/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
  raise_exc_info(self._exc_info)
 File "<string>", line 4, in raise_exc_info
 tornado.gen.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
...
 File "/root/anaconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 195, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.1.0.107:9796' after 10 s: connect() didn't finish in time

This has been hardcoded in a system that's been running for months now. So I'm just writing this question to verify I'm not doing anything wrong programmatically right? I think it must be something wrong with the environment. Does everything look right to you? what kind of things could be stopping this outside of dask and python? certificates? differing versions of packages? thoughts

MetaStack
  • 3,266
  • 4
  • 30
  • 67
  • did you ever resolve this? Facing the same issue, this error emerging without any significant changes. – Zev Averbach May 28 '21 at 12:09
  • 1
    @ZevAverbach oh man that was so long ago I'm certain I found a solution but I don't remember what it might have been. This may have been about the time we just rewrote the whole system: we dockerized the workers and scheduler at one point. Maybe it was an SSL issue - we ran into that a lot if the environment wasn't set up just right in that corporate network. you know what, I'll add an answer that is a little wrapper we made for dask: probably won't help much but maybe. – MetaStack May 28 '21 at 15:57

1 Answers1

0

(See comments in question)

A wrapper for dask mainly to bake in our particular configuration and make it easy to use in our system with docker containers:

''' daskwrapper: easy access to distributed computing '''
import webbrowser
from dask.distributed import Client as DaskClient
from . import config

scheduler_config = { # from yaml
    "scheduler_hostname": "schedulermachine.corpdomain.com"
    "scheduler_ip": "10.0.0.1"}
worker_config = { # from yaml
    "environments": {
        "generic": {
            "scheduler_port": 9796,
            "dashboard_port": 9797,
            "worker_port": 67176}}}

class Client():

    def __init__(self, environment: str):
        (
            self.scheduler_hostname,
            self.scheduler_port,
            self.dashboard_port,
            self.scheduler_address) = self.get_scheduler_details(environment)
        self.client = DaskClient(self.scheduler_address, asynchronous=False)

    def get_scheduler_details(self, environment: str) -> tuple:
        ''' gets it from a map of availble docker images... '''
        envs = worker_config['environments']
        return (
            scheduler_config['scheduler_hostname'],
            envs[environment]['scheduler_port'],
            envs[environment]['dashboard_port'],
            (
                f"{scheduler_config['scheduler_hostname']}:"
                f"{str(envs[environment]['scheduler_port'])}"))

    def open_status(self):
        webbrowser.open_new_tab(self.get_status())

    def get_status(self):
        return f'http://{self.scheduler_hostname}:{self.dashboard_port}/status'

    def get_async_client(self):
        ''' returns a client instance so the user can use it directly '''
        return DaskClient(self.scheduler_address, asynchronous=True)

    def get(self, workflow: dict, tasks: 'str|list'):
        return self.client.get(workflow, tasks)

    async def submit(self, function: callable, args: list):
        ''' saved as example dask api '''
        if not isinstance(args, list) and not isinstance(args, tuple):
            args = [args]
        async with DaskClient(self.scheduler_address, asynchronous=True) as client:
            future = client.submit(function, *args)
            result = await future
        return result

    def close(self):
        return self.client.close()

That was the Client and it was used this way:

from daskwrapper import Client
dag = {'some_task': (some_task_function, )}
workers = Client(environment='some_environment')
workers.get(workflow=dag, tasks='some_task')
workers.close()

The scheduler was started like this:

def start():
    def start_scheduler(port, dashboard_port):
        async def f():
            s = Scheduler(
                port=port,
                dashboard_address=f"0.0.0.0:{dashboard_port}")
            s = await s
            await s.finished()

        asyncio.get_event_loop().run_until_complete(f())

    worker_config = configs.get(repo='spartan_worker')
    envs = worker_config['environments']
    for key, value in envs.items():
        port = value['scheduler_port']
        dashboard_port = str(value['dashboard_port'])
        thread = Thread(
            target=start_scheduler,
            args=(port, dashboard_port))
        thread.start()

and the workers:

def start(
    scheduler_address: str,
    scheduler_port: int,
    worker_address: str,
    worker_port: int
):
    async def f(scheduler_address):
        w = await Worker(
            scheduler_address,
            port=worker_port,
            contact_address=f'{worker_address}:{worker_port}')
        await w.finished()

    asyncio.get_event_loop().run_until_complete(f(
        f'tcp://{scheduler_address}:{str(scheduler_port)}'))

This probably won't help you directly with this issue but I do believe since we dockerized it we didn't have that issue anymore. There's a lot missing here, but this is the basics, and there are probably much better ways to get specialized environments on distributed computing for easy use, but this fits our needs.

MetaStack
  • 3,266
  • 4
  • 30
  • 67