I get dask-worker to connect to dask-scheduler. My problem occurs after issuing tasks. It looks to me (in the task stream) that the workers do perform the computation. The error log from the dask worker is very long and I don't get it - it says timeout, connection refused? Which connection is it that's refused? AFAIK there are no firewalls between the two machines (on a LAN).
Note that same/similar looking errors occur over and over again. Eventually, the computation fails, stating "ValueError: Could not find dependent array-original-0effb3cc096e32a82e95557c88b795fd. Check worker logs"
distributed.nanny - INFO - Start Nanny at: 'tcp://10.0.0.42:36199'
distributed.worker - INFO - Start worker at: tcp://10.0.0.42:44304
distributed.worker - INFO - bokeh at: 10.0.0.42:8789
distributed.worker - INFO - http at: 10.0.0.42:40349
distributed.worker - INFO - nanny at: 10.0.0.42:36199
distributed.worker - INFO - Waiting to connect to: tcp://10.0.0.50:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 40
distributed.worker - INFO - Memory: 121.64 GB
distributed.worker - INFO - Local Directory: worker-qdz2_s09
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://10.0.0.50:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:34876
Traceback (most recent call last):
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 185, in connect
quiet_exceptions=EnvironmentError)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
tornado.gen.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/worker.py", line 1617, in gather_dep
who=self.address)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/core.py", line 479, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/core.py", line 583, in connect
connection_args=self.connection_args)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 194, in connect
_raise(error)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 177, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:34876' after 3.0 s: in <distributed.comm.tcp.TCPConnector object at 0x7fcbfc5e6f98>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 297, 0, 0)
distributed.worker - INFO - Dependent not found: array-original-7a8cba4415f43af718833379b651ccb6 0 . Asking scheduler
distributed.worker - INFO - Dependent not found: array-original-0effb3cc096e32a82e95557c88b795fd 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 263, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 292, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 256, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 278, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 284, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 275, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 285, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 301, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 295, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 303, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 271, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 281, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 287, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 305, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 282, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 173, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 178, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 190, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 185, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 195, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 194, 0, 0)
distributed.worker - INFO - Can't find dependencies for key ('array-concatenate-39749c96029f622599cd35ec80ca507c', 177, 0, 0)
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:34876
Traceback (most recent call last):
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 185, in connect
quiet_exceptions=EnvironmentError)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
tornado.gen.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/worker.py", line 1617, in gather_dep
who=self.address)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/core.py", line 479, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/core.py", line 583, in connect
connection_args=self.connection_args)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 194, in connect
_raise(error)
File "/home/paul/anaconda3/envs/ecopy/lib/python3.5/site-packages/distributed/comm/core.py", line 177, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:34876' after 3.0 s: in <distributed.comm.tcp.TCPConnector object at 0x7fcbfc50b4a8>: ConnectionRefusedError: [Errno 111] Connection refused