1

I am looking for help debugging the following setup:

I have 3 VPs Cloud instances from a hosting company. (I believe the VPS's are VMWare but I can't find any documentation on the host companies site.)

  • All are running Ubuntu 18.04.
  • I have installed docker on all 3.

All the docker versions are the same:

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.40
 Go version:        go1.12.12
 Git commit:        633a0ea838
 Built:             Wed Nov 13 07:29:52 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.5
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.12
  Git commit:       633a0ea838
  Built:            Wed Nov 13 07:28:22 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.4
  GitCommit:        e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
 runc:
  Version:          1.0.0-rc6+dev
  GitCommit:        6635b4f0c6af3810594d2770f662f34ddc15b40d
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

One Node 1 I ran the following init command:

docker swarm init --advertise-addr NODE_1_IP --data-path-port=7789

And on nodes 2 and 3 I ran the following join commands

 docker swarm join --token XXX -advertise-addr NODE_2/3_IP  NODE_1_IP:2377

Token is taken from the value Node 1 gave me. I have resolved a previous problem by specifying data-path-port. I think this is because the VPS are VMWare and it conflicts with the standard dataport

My cloud provider gives me a ui to apply firewall rules to individual VPS. I have used a firewall group to apply the following rules to all 3 servers:

TCP ACCEPT to dest ports 80, 443, (and my SSH port)
ICMP ACCEPT any
TCP ACCEPT 2376
TCP, UDP ACCEPT 7789
UDP ACCEPT 7789
TCP ACCEPT 2377
ESP ACCEPT

To test this I ran the following commands on node 1

docker network create --driver=overlay --attachable testnet
docker network create --opt encrypted --driver=overlay --attachable testnet_encrypted

docker service create --network=testnet --name web --publish 80 --replicas=5 nginx:latest

Once the service is running across the cluster I do the following:

docker run --rm --name alpine --net=testnet -ti alpine:latest sh
apk add --no-cache curl

I then cur curl 5 times:

curl web

All 5 times I get a response. If I keep going I keep getting responses. I think this means all containers are getting traffic.

Then I switch the server over to the encrypted network and repeat the same test:

docker service rm web
docker service create --network=testnet_encrypted --name web --publish 80 --replicas=5 nginx:latest

docker run --rm --name alpine --net=testnet_encrypted -ti alpine:latest sh
apk add --no-cache curl

Again I run curl 5 times:

curl web

It will work sometimes and other times it will just sit there and hang until I press ctrl-c.

If I run it a multiple of 5 times the pattern of working and broken repeats. I think this is because some containers are running on NODE_1 and these work but communication to nodes 2 and 3 is not working.

The ESP ACCEPT rule was added to my cloud provider firewall ruleset after some research into the issue.

I have tried rebooting the cluster but no luck.

Now I am stuck. Are there any recommendations into how I can proceed with debugging. Thanks Robert

Update 1

To debug I changed the test so the web service was only running on a single instance on NODE_3. I then loaded two consoles for node 3 and ran the following commands:

sudo tcpdump src NODE_1_IP and dst NODE_3_IP and port 7789
sudo tcpdump src NODE_3_IP and dst NODE_1_IP and port 7789

One console will show me traffic into NODE_3 the the other traffic out of NODE_3.

I then ran the unencrypted test again. I see about 7 lines appear on both incoming console and 5 lines appear on outgoing console. So there is traffic going into NODE_3 and traffic going out of NODE_3, and the test is wrking

I then ran the encrypted test This time I see a single line appear on the incoming console, and nothing on the outgoing console. So a single packet is getting to NODE_3. I am not sure if it is getting decrypted and sent back to the container.

Update 2

One area of config I failed to mention is that I have the following /etc/docker/daemon.json setup:

{
    "hosts": ["unix:///var/run/docker.sock", "tcp://0.0.0.0:2376"],
    "tlscacert": "/var/docker/ca.pem",
    "tlscert": "/var/docker/server-cert.pem",
    "tlskey": "/var/docker/server-key.pem",
    "tlsverify": true
}

This is to allow me to use ssl client certs to connect remotely. This file was setup on all nodes before I created the swarm.

As Decryption of the packets looks like a possible cause I have changed my daemon.json to the following:

{
    "hosts": []
}

I then rebooted each machine. The test results are the same - still not working.

I then ran the command: docker swarm ca --rotate and re-ran the tests. This has the same result.

I have not removed and re-inited the cluster with the new config. (I could do if someone thinks it would help but I have a lot of docker secrets and config which I would lose in the process.)

Update 3

i have now completely removed and re-inited the cluster. This has not worked.

Some sources say that the following command:

sudo tcpdump -p esp

When run on the nodes should show traffic. I have run this on all nodes in the cluster and repeated all tests and there is no output anywhere.

ufw it inactive on all the nodes:

robert@metcaac6:/var/log$ sudo ufw status
[sudo] password for robert: 
Status: inactive

but when I run iptables -L I get the same rules on every node:

robert@metcaac6:/var/log$ sudo iptables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     udp  --  anywhere             anywhere             policy match dir in pol ipsec udp dpt:7789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100300"
DROP       udp  --  anywhere             anywhere             udp dpt:7789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100300"

Chain FORWARD (policy DROP)
target     prot opt source               destination         
DOCKER-USER  all  --  anywhere             anywhere            
DOCKER-INGRESS  all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
DROP       all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain DOCKER (2 references)
target     prot opt source               destination         

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:https
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:http
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:30001
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:30001
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:30000
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:30000
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination         
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-ISOLATION-STAGE-2 (2 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere            
DROP       all  --  anywhere             anywhere            
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-USER (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere     

I have inspected dmesg and /var/log/syslog looking for possible issues but I can't find any.

I still have no ideas where I should be looking for the issue.

Robert3452
  • 113
  • 5
  • Try running the following on each node: `sudo iptables -I INPUT -p 50 -j ACCEPT` – BMitch Jan 27 '20 at 13:55
  • I did this and verified the rule was added. This had no effect. I also added it to INPUT and FORWARD chains and again there was no effect. – Robert3452 Jan 27 '20 at 14:39
  • If I understand correctly the first two input rules will recieve a UDP packet and if it matches some ipsec thing the ipsec module. The second rule drops the packets that were sent to ipsec. There is no protocol 50 rule there, should there be? – Robert3452 Jan 27 '20 at 15:50

0 Answers0