I have a 4 node Kubernetes cluster, 1 x controller and 3 x workers. The following shows how they are configured with the versions.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-ctrl-1 Ready master 1h v1.11.2 192.168.191.100 <none> Ubuntu 18.04.1 LTS 4.15.0-1021-aws docker://18.6.1
turtle-host-01 Ready <none> 1h v1.11.2 192.168.191.53 <none> Ubuntu 18.04.1 LTS 4.15.0-29-generic docker://18.6.1
turtle-host-02 Ready <none> 1h v1.11.2 192.168.191.2 <none> Ubuntu 18.04.1 LTS 4.15.0-34-generic docker://18.6.1
turtle-host-03 Ready <none> 1h v1.11.2 192.168.191.3 <none> Ubuntu 18.04.1 LTS 4.15.0-33-generic docker://18.6.1
Each of the nodes has two network interfaces, for arguments sake eth0
and eth1
. eth1
is the network that I want to the cluster to work on. I setup the controller using kubeadm init
and passed --api-advertise-address 192.168.191.100
. The worker nodes where then joined using this address.
Finally on each node I modified the kubelet service to have --node-ip
set so that the layout looks as above.
The cluster appears to be working correctly and I can create pods, deployments etc. However the issue I have is that none of the pods are able to use the kube-dns
service for DNS resolution.
This is not a problem with resolution, rather that the machines cannot connect to the DNS service to perform the resolution. For example if I run a busybox
container and access it to perform nslookup
i get the following:
/ # nslookup www.google.co.uk
nslookup: read: Connection refused
nslookup: write to '10.96.0.10': Connection refused
I have a feeling that this is down to not using the default network and because of that I suspect some Iptables rules are not correct, that being said these are just guesses.
I have tried both the Flannel overlay and now Weave net. The pod CIDR range is 10.32.0.0/16
and the service CIDR is as default.
I have noticed that with Kubernetes 1.11 there are now pods called coredns
rather than one kube-dns
.
I hope that this is a good place to ask this question. I am sure I am missing something small but vital so if anyone has any ideas that would be most welcome.
Update #1:
I should have said that the nodes are not all in the same place. I have a VPN running between them all and this is the network I want things to communicate over. It is an idea I had to try and have distributed nodes.
Update #2:
I saw another answer on SO (DNS in Kubernetes not working) that suggested kubelet
needed to have --cluster-dns
and --cluster-domain
set. This is indeed the case on my DEV K8s cluster that I have running at home (on one network).
However it is not the case on this cluster and I suspect this is down to a later version. I did add the two settings to all nodes in the cluster, but it did not make things work.
Update #3
The topology of the cluster is as follows.
- 1 x Controller is in AWS
- 1 x Worker is in Azure
- 2 x Worker are physical machines in a colo Data Centre
All machines are connected to each other using ZeroTier VPN on the 192.168.191.0/24 network.
I have not configured any special routing. I agree that this is probably where the issue is, but I am not 100% sure what this routing should be.
WRT to kube-dns
and nginx
, I have not tainted my controller so nginx
is not on the master, not is busybox
. nginx
and busybox
are on workers 1 and 2 respectively.
I have used netcat
to test connection to kube-dns
and I get the following:
/ # nc -vv 10.96.0.10 53
nc: 10.96.0.10 (10.96.0.10:53): Connection refused
sent 0, rcvd 0
/ # nc -uvv 10.96.0.10 53
10.96.0.10 (10.96.0.10:53) open
The UDP connection does not complete.
I modified my setup so that I could run containers on the controller, so kube-dns
, nginx
and busybox
are all on the controller, and I am able to connect and resolve DNS queries against 10.96.0.10.
So all this does point to routing or IPTables IMHO, I just need to work out what that should be.
Update #4
In response to comments I can confirm the following ping test results.
Master -> Azure Worker (Internet) : SUCCESS : Traceroute SUCCESS
Master -> Azure Worker (VPN) : SUCCESS : Traceroute SUCCESS
Azure Worker -> Master (Internet) : SUCCESS : Traceroute FAIL (too many hops)
Azure Worker -> Master (VPN) : SUCCESS : Traceroute SUCCESS
Master -> Colo Worker 1 (Internet) : SUCCESS : Traceroute SUCCESS
Master -> Colo Worker 1 (VPN) : SUCCESS : Traceroute SUCCESS
Colo Worker 1 -> Master (Internet) : SUCCESS : Traceroute FAIL (too many hops)
Colo Worker 1 -> Master (VPN) : SUCCESS : Traceroute SUCCESS
Update 5
After running the tests above, it got me thinking about routing and I wondered if it was as simple as providing a route to the controller over the VPN for the service CIDR range (10.96.0.0/12
).
So on a host, not included in the cluster, I added a route thus:
route add -net 10.96.0.0/12 gw 192.168.191.100
And I could then resolve DNS using the kube-dns
server address:
nslookup www.google.co.uk 10.96.0.10
SO I then added a route, as above, to one of the worker nodes and tried the same. But it is blocked and I do not get a response. Given that I can resolve DNS over the VPN with the appropriate route from a non-kubernetes machine, I can only think that there is an IPTables rule that needs updating or adding.
I think this is almost there, just one last bit to fix.
I realise this is wrong as it it the kube-proxy
should do the DNS resolution on each host. I am leaving it here for information.