2

My goal is to set up a docker swarm on a group of 3 linux (ubuntu) physical workstations and run a dask cluster on that.

$ docker --version
Docker version 17.06.0-ce, build 02c1d87

I am able to init the docker swarm and add all of the machines to the swarm.

cordoba$ docker node ls
ID                            HOSTNAME    STATUS    AVAILABILITY MANAGER STATUS
j8k3hm87w1vxizfv7f1bu3nfg     box1        Ready     Active              
twg112y4m5tkeyi5s5vtlgrap     box2        Ready     Active              
upkr459m75au0vnq64v5k5euh *   box3        Ready     Active              Leader

I then run docker stack deploy -c docker-compose.yml dask-cluster on the Leader box.

Here is docker-compose.yml:

version: "3"

services:

  dscheduler:
    image: richardbrks/dask-cluster
    ports:
     - "8786:8786"
     - "9786:9786"
     - "8787:8787"
    command: dask-scheduler
    networks:
      - distributed
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints: [node.role == manager]

  dworker:
    image: richardbrks/dask-cluster
    command: dask-worker dscheduler:8786
    environment:
      - "affinity:container!=dworker*"
    networks:
      - distributed
    depends_on:
      - dscheduler
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure

networks:
  distributed:

and here is richardbrks/dask-cluster:

# Official python base image
FROM python:2.7    
# update apt-repository
RUN apt-get update
# only install enough library to run dask on a cluster (with monitoring)
RUN pip install --no-cache-dir \
    psutil \
    dask[complete]==0.15.2 \
    bokeh

When I deploy the swarm, the dworker nodes that are not on the same machine as dscheduler does not know what dscheduler is. I ssh'd into one of these nodes and looked in env, and dscheduler was not there. I also tried to ping dscheduler, and got "ping: unknown host".

I thought docker was supposed to provide an internal dns based for service discovery so that calling dscheduler will take me to the address of the dschedler node.

Is there some set up to my computers that I am missing? or are any of my files missing something?

All of this code is also located in https://github.com/MentalMasochist/dask-swarm

Rich
  • 905
  • 1
  • 6
  • 10
  • Could you please describe how you try to access the other service? Do you do in inside the dworker container? – herm Sep 14 '17 at 14:03
  • @herm Yes. I ssh into the computer where the `dworker` node is being ran, I use `docker ps` to get the id of the container running, and then I type `docker exec -ti /bin/bash` to enter into the node. That is where I'm attempting to ping `dscheduler`. – Rich Sep 14 '17 at 14:09
  • You are confusing terms. A node in a swarm is a computer. with docker exec you enter a container and not a node. You used the wrong names but did the right thing :) – herm Sep 14 '17 at 14:45
  • I checked and your setup works fine and I could telnet from worker to scheduler on different machine – Tarun Lalwani Sep 14 '17 at 16:50

2 Answers2

0

According to this issue in swarm:

Because of some networking limitations (I think related to virtual IPs), the ping tool will not work with overlay networking. Are you service names resolvable with other tools like dig?

Personally I could always connect from one service to the other using curl. Your setup seems correct and your services should be able to communicate.


FYI depends on is not supported in swarm


Update 2: I think you are not using the port. Servicename is no replacement for the port. You need to use the port as the container knows it internally.

herm
  • 14,613
  • 7
  • 41
  • 62
  • I installed and ran dig inside the container, but got a `NXDOMAIN` error, which means it couldn't find the host. Your issue link showed some other possible reasons for not being able to connect other services on other hosts. I will read though the issue and see if any of their suggestions solve me problem. Also, Thanks for informing me about `depends_on`. – Rich Sep 14 '17 at 17:53
  • Tarun Lalwani confirmed that your compose file is correct. What is the exact command you use to connect the containers? For curl it would be: curl http://dscheduler:8786/path – herm Sep 14 '17 at 18:53
  • The container dworker should connect to dscheduler from the command `dask-worker dscheduler:8786` from the compose file where `dscheduler` should be the ip of the scheduler and 8786 is the port. Does this answer your question? – Rich Sep 14 '17 at 23:20
  • yes. Can you reach the other machine by IP? maybe the ports are not open or a firewall is intervening. – herm Sep 15 '17 at 12:45
  • I am able to ping the master node machine from the worker node and from the container in the worker node. – Rich Sep 15 '17 at 13:34
  • I have no idea why this won't work for you. I'm sorry – herm Sep 15 '17 at 13:46
  • Also, from the worker node (not in the container though), netcat is able to connect to the following ports on the master node: 2377(tcp), 7946(tcp), 7946(udp) and 4789(udp). – Rich Sep 15 '17 at 13:52
  • what about 8786? Since you publish this port (8786:8786) it should also be reachable – herm Sep 15 '17 at 13:58
  • I think I found something. When I am inside the container on the worker node, I get the following error: `root@adc78cf2c38d:/# netcat -vz cordoba..com 8786 DNS fwd/rev mismatch: cordoba..com != -static.hfc.comcastbusiness.net cordoba.company.com [] 8786 (?) open` Values with <> around them are me anonymizing the output. – Rich Sep 15 '17 at 14:03
  • Your network seems to be your problem and not docker swarm – herm Sep 15 '17 at 14:27
  • 1
    Found the problem: The firmware in our office's router had a bug. Once I went back to a prior firmware version, everything worked fine. @herm: thanks for the help! – Rich Sep 25 '17 at 19:30
0

There was nothing wrong with dask or docker swarm. The problem was bad router firmware. After I went back to a prior version of the router firmware, the cluster worked fine.

Rich
  • 905
  • 1
  • 6
  • 10