0

I have two machines running on Ubuntu 18.04 with Docker version 18.09.9 installed on them. I've setup a swarm cluster with the manager node advertising its public IP and the worker node advertising its private IP :

// On manager
docker swarm init --advertise-addr INSTANCE_PUBLIC_IP

// On worker
docker swarm join --advertise-addr INSTANCE_PRIVATE_IP --token XXXXXX MANAGER_PUBLIC_IP:2377

The two machines are running on the same private network and the manager is able to connect to the worker's private IP. The swarm is mostly working, I can deploy services, see the ingress network on both nodes, etc... but when I deploy a service whose container goes on the worker node, I can't reach it via the manager node. The connection times out. From the worker node, the connection succeeds.

If however I make the worker node advertise its public IP, everything works well. The nodes are hosted by Digitalocean, do you have any idea where this issue comes from ?

Related to Docker Swarm routing mesh connections time out.

2 Answers2

0

This is typically the result of a firewall blocking the vxlan overlay networking ports between the host. You want the following opened:

  • 2377/tcp: swarm manager communication
  • 7946/tcp+udp: overlay networking control port
  • 4789/udp: overlay networking data port
  • protocol 50: only needed if you enable security on an overlay network

With iptables, you would need the following on each node in the swarm cluster:

iptables -A INPUT -p tcp -m tcp --dport 2377 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 7946 -j ACCEPT
iptables -A INPUT -p udp -m udp --dport 7946 -j ACCEPT
iptables -A INPUT -p tcp -m udp --dport 4789 -j ACCEPT
iptables -A INPUT -p 50 -j ACCEPT

Note that some tools, in particularly VMWare NSX, will block the vxlan ports.

BMitch
  • 5,966
  • 1
  • 25
  • 32
0

I finally figured it out ! The issue wasn't about closed ports or anything misconfigured, but rather about the internals of Digitalocean, network wise.

They setup a NAT between the public IP and private IP of the instances. It turns out, having a p2p connection going through a NAT doesn't mix well. As stated in this comment (https://github.com/docker/swarmkit/issues/1429#issuecomment-361924332) :

The key is to have direct connectivity (no NAT on the way) from worker node to manager node and vice versa.

This page also mentions potential issues and limitations with NAT : https://en.wikipedia.org/wiki/Network_address_translation#Issues_and_limitations

It says that, "Unless the NAT router makes a specific effort to support such protocols, incoming packets cannot reach their destination.".

So even if you can end up having swarm fully working via a NAT system, I wouldn't recommend doing so.