5

I have a 4 node Kubernetes cluster, 1 x controller and 3 x workers. The following shows how they are configured with the versions.

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-ctrl-1 Ready master 1h v1.11.2 192.168.191.100 <none> Ubuntu 18.04.1 LTS 4.15.0-1021-aws docker://18.6.1 turtle-host-01 Ready <none> 1h v1.11.2 192.168.191.53 <none> Ubuntu 18.04.1 LTS 4.15.0-29-generic docker://18.6.1 turtle-host-02 Ready <none> 1h v1.11.2 192.168.191.2 <none> Ubuntu 18.04.1 LTS 4.15.0-34-generic docker://18.6.1 turtle-host-03 Ready <none> 1h v1.11.2 192.168.191.3 <none> Ubuntu 18.04.1 LTS 4.15.0-33-generic docker://18.6.1

Each of the nodes has two network interfaces, for arguments sake eth0 and eth1. eth1 is the network that I want to the cluster to work on. I setup the controller using kubeadm init and passed --api-advertise-address 192.168.191.100. The worker nodes where then joined using this address.

Finally on each node I modified the kubelet service to have --node-ip set so that the layout looks as above.

The cluster appears to be working correctly and I can create pods, deployments etc. However the issue I have is that none of the pods are able to use the kube-dns service for DNS resolution.

This is not a problem with resolution, rather that the machines cannot connect to the DNS service to perform the resolution. For example if I run a busybox container and access it to perform nslookup i get the following:

/ # nslookup www.google.co.uk nslookup: read: Connection refused nslookup: write to '10.96.0.10': Connection refused

I have a feeling that this is down to not using the default network and because of that I suspect some Iptables rules are not correct, that being said these are just guesses.

I have tried both the Flannel overlay and now Weave net. The pod CIDR range is 10.32.0.0/16 and the service CIDR is as default.

I have noticed that with Kubernetes 1.11 there are now pods called coredns rather than one kube-dns.

I hope that this is a good place to ask this question. I am sure I am missing something small but vital so if anyone has any ideas that would be most welcome.

Update #1:

I should have said that the nodes are not all in the same place. I have a VPN running between them all and this is the network I want things to communicate over. It is an idea I had to try and have distributed nodes.

Update #2:

I saw another answer on SO (DNS in Kubernetes not working) that suggested kubelet needed to have --cluster-dns and --cluster-domain set. This is indeed the case on my DEV K8s cluster that I have running at home (on one network).

However it is not the case on this cluster and I suspect this is down to a later version. I did add the two settings to all nodes in the cluster, but it did not make things work.

Update #3

The topology of the cluster is as follows.

  • 1 x Controller is in AWS
  • 1 x Worker is in Azure
  • 2 x Worker are physical machines in a colo Data Centre

All machines are connected to each other using ZeroTier VPN on the 192.168.191.0/24 network.

I have not configured any special routing. I agree that this is probably where the issue is, but I am not 100% sure what this routing should be.

WRT to kube-dns and nginx, I have not tainted my controller so nginx is not on the master, not is busybox. nginx and busybox are on workers 1 and 2 respectively.

I have used netcat to test connection to kube-dns and I get the following:

/ # nc -vv 10.96.0.10 53 nc: 10.96.0.10 (10.96.0.10:53): Connection refused sent 0, rcvd 0 / # nc -uvv 10.96.0.10 53 10.96.0.10 (10.96.0.10:53) open

The UDP connection does not complete.

I modified my setup so that I could run containers on the controller, so kube-dns, nginx and busybox are all on the controller, and I am able to connect and resolve DNS queries against 10.96.0.10.

So all this does point to routing or IPTables IMHO, I just need to work out what that should be.

Update #4

In response to comments I can confirm the following ping test results.

Master -> Azure Worker (Internet)  : SUCCESS : Traceroute SUCCESS
Master -> Azure Worker (VPN)       : SUCCESS : Traceroute SUCCESS
Azure Worker -> Master (Internet)  : SUCCESS : Traceroute FAIL (too many hops)
Azure Worker -> Master (VPN)       : SUCCESS : Traceroute SUCCESS

Master -> Colo Worker 1 (Internet) : SUCCESS : Traceroute SUCCESS
Master -> Colo Worker 1 (VPN)      : SUCCESS : Traceroute SUCCESS
Colo Worker 1 -> Master (Internet) : SUCCESS : Traceroute FAIL (too many hops)
Colo Worker 1 -> Master (VPN)      : SUCCESS : Traceroute SUCCESS

Update 5

After running the tests above, it got me thinking about routing and I wondered if it was as simple as providing a route to the controller over the VPN for the service CIDR range (10.96.0.0/12).

So on a host, not included in the cluster, I added a route thus:

route add -net 10.96.0.0/12 gw 192.168.191.100

And I could then resolve DNS using the kube-dns server address:

nslookup www.google.co.uk 10.96.0.10

SO I then added a route, as above, to one of the worker nodes and tried the same. But it is blocked and I do not get a response. Given that I can resolve DNS over the VPN with the appropriate route from a non-kubernetes machine, I can only think that there is an IPTables rule that needs updating or adding.

I think this is almost there, just one last bit to fix.

I realise this is wrong as it it the kube-proxy should do the DNS resolution on each host. I am leaving it here for information.

Russell Seymour
  • 1,333
  • 1
  • 16
  • 35
  • What's the output of `kubectl exec -ti busybox -- nslookup kubernetes.default`? – Nicola Ben Sep 11 '18 at 16:54
  • @NicolaBen The output of this is `nslookup: write to '10.96.0.10': Connection refused`. This is why I think it is a routing or FW issue preventing connection. – Russell Seymour Sep 11 '18 at 18:00
  • > I should have said that the nodes are not all in the same place. What do you mean? Are they in different regions? Different VPCs? What kind of subnets are seen on the hosts? How is the VPN configured? – leodotcloud Sep 11 '18 at 18:43
  • @leodotcloud My apologies for not adding all that in. I have now updated my post. – Russell Seymour Sep 12 '18 at 07:25

2 Answers2

1

Following the instruction at this page, try to run this:

apiVersion: v1
kind: Pod
metadata:
  namespace: default
  name: dns-example
spec:
  containers:
    - name: test
      image: nginx
  dnsPolicy: "None"
  dnsConfig:
    nameservers:
      - 1.2.3.4
    searches:
      - ns1.svc.cluster.local
      - my.dns.search.suffix
    options:
      - name: ndots
        value: "2"
      - name: edns0

and see if a manual configuration works or you have some networking DNS problem.

Nicola Ben
  • 10,615
  • 8
  • 41
  • 65
  • Thanks for this. I do have external DNS access when I use a similar configuration to this. The problem is that I do not have access to the Kube DNS service within the cluster so I cannot resolve services. – Russell Seymour Sep 11 '18 at 17:23
1

Sounds like you are running on AWS. I suspect that your AWS security group is not allowing DNS traffic to go through. You can try allowing all traffic to the Security Group(s) where all your master and nodes are, to see if that's the problem.

sg

You can also check that all your masters and nodes are allowing routing:

cat /proc/sys/net/ipv4/ip_forward

If not

echo 1 > /proc/sys/net/ipv4/ip_forward

Hope it helps.

Rico
  • 58,485
  • 12
  • 111
  • 141
  • Thanks for the suggestion, but alas it does not work. You are correct that the controller is in AWS. The other machines are not in AWS but connected via a VPN - this is an idea I am playing about with. – Russell Seymour Sep 11 '18 at 17:20
  • is that problem related only to connectivity to dns service or to services communication at all? Can you just create your own pod with some hello world web service, attach kubernetes service to it and create another container to query that service? I know that you have problem with DNS resolution but every service has own IP so to query just use service's IP. Did you use Flannel with VXLAN ? – Jakub Bujny Sep 11 '18 at 17:54
  • @JakubBujny Yes if I deploy a pod (nginx) and a service for it I am able to contact that service from my `busybox` pod using the Service IP address. No I did not use Flannel with VXLAN, I will do some research on that now. – Russell Seymour Sep 11 '18 at 18:20
  • k8s does a lot of iptables manipulation so I wouldn't be surprised if it's related to that combined with your cloud firewall rules. – Rico Sep 11 '18 at 18:26
  • That's great - could you please check the same using kube-dns service's IP and telnet on port 53? Are your busybox on node and kube-dns on master? That's important. Please provide deeper description about network I mean: where is master? Where is nginx? Where is busybox? Is that all on-prem or AWS or hybrid? Whic VPN are you using and how you configured routing? – Jakub Bujny Sep 11 '18 at 18:30
  • @JakubBujny I have updated my question with all this information. Thanks for looking. – Russell Seymour Sep 12 '18 at 07:25
  • Ok please leave kubernetes and go onto VM level using SSH. Please open firewalls/security groups for ICMP and test connections using pings and traceroute in following scenarios (controller=master): `master <-> worker in Azure`, `master <-> worker in colo Data Center` – Jakub Bujny Sep 12 '18 at 07:36
  • @JakubBujny I have added the test results to my original post. – Russell Seymour Sep 12 '18 at 12:19