I've been trying to set up a dask.distributed
cluster using kubernetes
. Setting up the kube
cluster itself is pretty straightforward, the problem I am currently struggling with is that I can't get the local scheduler to connect to the workers. Workers can connect to the scheduler, but they advertise an address inside the kube
network that is not accessible to the scheduler running outside the kube
network.
Following the examples from the dask-kubernetes
docs I got a kube
cluster running on AWS and (on a separate AWS machine) started a notebook
with the local dask.distributed
scheduler. The scheduler launches a number of workers on the kube
cluster, but it can not connect to said workers because the workers are on a different network: the internal kube
network.
The network setup looks like the following:
- notebook server running on 192.168.0.0/24
kube
cluster EC2 instances also on 192.168.0.0/24kube
pods on 100.64.0.0/16
the dask
scheduler runs on 192.168.0.0/24
but the dask
workers are on 100.64.0.0/16
- how do I connect the two? Should I be running the scheduler also in a kube
pod, edit routing tables, try to figure out the host machines' IPs address on the workers?
The workers are able to connect to the scheduler, but in the scheduler I get a errors of the form
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://100.96.2.4:40992': Timed out trying to connect to 'tcp://100.96.2.4:40992' after 3.0 s: connect() didn't finish in time
I'm not looking for a list of possible things I could do, I'm looking for the recommended way of setting this up, specifically in relation to dask.distributed
.
I set up the kube
cluster using kops
.