Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain

Question

I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error.

Below are the version of platforms I am using

[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", 
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", 
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables)

I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task.

I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db.

Troubleshooting steps I have tried.

I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected
I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend.
I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error.
I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry

Cluster Info

[node1 ~]# kubectl get all -n awx
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME                      READY     STATUS             RESTARTS   AGE
po/awx-654f7fc84c-9ppqb   3/4       CrashLoopBackOff   11         38m

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
svc/awx-rmq-mgmt   ClusterIP   10.233.10.146   <none>        15672/TCP                        1d
svc/awx-web-svc    NodePort    10.233.3.75     <none>        80:31700/TCP                     1d
svc/rabbitmq       NodePort    10.233.37.33    <none>        15672:30434/TCP,5672:31962/TCP   1d

AWX RabbitMQ error log

[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
 Starting RabbitMQ 3.7.4 on Erlang 20.1.7
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
 node           : rabbit@10.233.120.5
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : at619UOZzsenF44tSK3ulA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering:  OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Kubernetes API service

[node1 ~]# kubectl describe service kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.233.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.237.34.19:6443,10.237.34.21:6443
Session Affinity:  ClientIP
Events:            <none>

nslookup from a busybox in the same kubernetes cluster

[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup  kubernetes.default.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

Please let me know if I am missing anything that could help troubleshooting.

_I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry_ what does that mean? I wouldn't expect the hostname to be listed in `/etc/resolv.conf` — mdaniel, Jul 10 '18 at 04:59

score 1 · Accepted Answer · answered Jul 10 '18 at 04:57

1

I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster.

If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list).

As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)

answered Jul 10 '18 at 04:57

mdaniel

31,240
5
55
58

Thank you for help. I ended up changing the awx_rabbit deployment config to `#cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s cluster_formation.k8s.host = 10.237.34.19 cluster_formation.k8s.port = 6443` However I am still not understanding why the awx rabbit pod is unable to resolv kubernetes.default.svc.cluster.local but the test-rabbit pods can. This is ultimately a spof in my cluster given that if the master api node goes down it does not have a backup like it would pointing to the kubernetes service. – kaylor Jul 10 '18 at 14:39
sure, that's why you'd use the `Service` IP like I said: `10.233.0.1` with the `Service` port of `443`, not the individual `Endpoint` IPs and their port numbers – mdaniel Jul 11 '18 at 04:19
feel free to post `kubectl -n awx get -o yaml pod -l app=rabbitmq` as a pastebin or something, and we can see if there is something out of alignment with the amx RMQ specifically – mdaniel Jul 11 '18 at 04:23
I ended up getting it to work by pointing it to cluster_formation.k8s.host = 10.233.0.1 cluster_formation.k8s.port = 443 like you suggested. It still is unable to resolve the kubernetes.default.svc.cluster.local dns entry but I believe it may be something related to my CNI and DNS setup. Thank you for your help! – kaylor Jul 11 '18 at 21:49
Still getting the same error with awx 1.0.7.2. I am thinking its a proxy issue now maybe? – kaylor Aug 24 '18 at 18:44
I don't understand your question; are you using a proxy in your Pods? and is the "proxy issue" because you have used `.host = 10.233.0.1` and it is still giving you problems or something? At this point, you might be better served opening a new question with the actual behavior, rather than trying to pivot this question – mdaniel Aug 24 '18 at 20:08
I went ahead and opened an issues with awx since this potentially seems to be only with the awx pod. https://github.com/ansible/awx/issues/2201 – kaylor Aug 27 '18 at 16:20

Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain

1 Answers1