Dask workers time out shortly after starting

Question

Good Afternoon SO, I am trying to deploy a WRF post-processing solution in Python using Dask and wrf-python that is run on a cluster, however I am encountering an issue with the interactivity between the dask scheduler and the worker instances.

In my process, I have a scheduler that is deployed on the primary script (That is run on the login node of the cluster) using the following code block:

    cLoop = IOLoop.current()
    t = Thread(target = cLoop.start, daemon = True)
    t.start()
    s = Scheduler(loop = cLoop, dashboard_address = None)
    s.start("tcp://: " + str(scheduler_port))
    dask_client = Client("tcp://" + socket.gethostname() + ":" + str(scheduler_port))

This script then waits for dask workers to be run on the compute nodes of the system which are initialized through two shell scripts (One for the job, another to start the worker):

#!/bin/bash
#COBALT -t 60
#COBALT -n 8
#COBALT -A climate_severe
#COBALT -q debug-cache-quad
#COBALT --attrs mcdram=cache:numa=quad

aprun -n ${COBALT_JOBSIZE} -N 1 -d 64 -j 1 ./launch-worker.sh

The second script is programatically generated based on what login node the original script is running on and configuration settings in the program

#!/bin/bash
export PYTHONPATH=${PYTHONPATH}:/projects/climate_severe/wrf-run/post/Python/

/projects/climate_severe/Python/anaconda/bin/python3.7 -m distributed.cli.dask_worker \
thetalogin4:12345 --nprocs 1\
 --death-timeout 120 --no-dashboard

This setup functions and the workers connect to the scheduler, however they terminate about a minute or two after the initial connection is made. The scheduler does not push any errors to the python terminal (Debug prints are turned on via: logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)). The worker information is pushed to the job error file which is generated during the run:

distributed.worker - INFO -       Start worker at:  tcp://10.236.16.130:16839
distributed.worker - INFO -          Listening to:  tcp://10.236.16.130:16839
distributed.worker - INFO - Waiting to connect to:    tcp://thetalogin4:12345
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                        256
distributed.worker - INFO -                Memory:                  202.69 GB
distributed.worker - INFO -       Local Directory: /lus/theta-fs0/projects/climate_severe/runs/20180601/postprd/worker-4u3lent8
.
.
.
distributed.core - INFO - Event loop was unresponsive in Worker for 25.80s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - ERROR - Worker stream died during communication: tcp://10.236.16.120:15761
Traceback (most recent call last):
  File "/projects/climate_severe/Python/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/projects/climate_severe/Python/anaconda/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

After this error, there are many more that are quickly pushed regarding StreamClosedError and dependencies that are missing:

OSError: Timed out trying to connect to 'tcp://10.236.16.120:15761' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x2aaaed9616a0>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.worker - INFO - Can't find dependencies for key ('wrapped_add_then_div-fe3451c36b590fa821c9101013c573b4', 0, 0, 0)
distributed.worker - INFO - Dependent not found: ('getitem-6f7afbff56ac240317ffbfde59bfcb8a', 0, 0, 0) 0 .  Asking scheduler
distributed.worker - ERROR - Worker stream died during communication: tcp://10.236.16.130:28228

The requested function is located in a python script in the directory on PYTHONPATH (Set in the launch-worker.sh script).

What baffles me here is I had a minimal working version of this package (One node, 8 processes) using a few variables that happened one time, but the moment I started increasing the number of variables it started doing this. I have tried changing the setup to 8 nodes, 1 process on each to increase the memory allowance from 16GB to the full 200GB, I have even reduced the script to only post-process a single variable, and nothing so far has completed successfully (I get this same error each time).

Any assistance on identifying the source of the problem here would be greatly appreciated, the full code is available on GitHub if additional context is needed.

Thanks!

Dask workers time out shortly after starting

0 Answers0