Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no issues. It will hang indefinitely in a form shown below (this loop otherwise takes a few seconds to complete):
In this case, I see that there just a handful of stalled processes, and all are on one particular node (192.168.0.228):
Each worker on this node is stalled on a couple read_parquet tasks:
This was called using the following code and is using fastparquet:
ddf = dd.read_parquet(file_path, columns=['col1', 'col2'], index=False, gather_statistics=False)
My cluster is running Ubuntu 19.04 and all the latest versions (as of 11/12) of Dask and Distributed and the required packages (e.g., tornado, fsspec, fastparquet, etc.)
The data that the .228 node is trying to access is located on another node in my cluster. The .228 node accesses the data through CIFS file sharing. I run the Dask scheduler on the same node on which I'm running the script (different from both the .228 node and the data storage node). The script connects the workers to the scheduler via SSH using Paramiko:
ssh_client = paramiko.SSHClient()
stdin, stdout, stderr = ssh_client.exec_command('sudo dask-worker ' +
' --name ' + comp_name_decode +
' --nprocs ' + str(nproc_int) +
' --nthreads 10 ' +
self.dask_scheduler_ip, get_pty=True)
The connectivity of the .228 node to the scheduler and to the data storing node all look healthy. It is possible that the .228 node experienced some sort of brief connectivity issue while trying to process the read_parquet task, but if that occurred, the connectivity of .228 node to the scheduler and the CIFS shares were not impacted beyond that brief moment. In any case, the logs do not show any issues. This is the whole log from the .228 node:
distributed.worker - INFO - Start worker at: tcp://192.168.0.228:42445
distributed.worker - INFO - Listening to: tcp://192.168.0.228:42445
distributed.worker - INFO - dashboard at: 192.168.0.228:37751
distributed.worker - INFO - Waiting to connect to: tcp://192.168.0.167:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 14.53 GB
distributed.worker - INFO - Local Directory: /home/dan/worker-50_838ig
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.0.167:8786
distributed.worker - INFO - -------------------------------------------------
Putting aside whether this is a bug in Dask or in my code/network, is it possible to set a general timeout for all tasks handled by the scheduler? Alternatively, is it possible to:
- identify stalled tasks,
- copy a stalled task and move it to another worker, and
- cancel the stalled task?