0

I run Kafka inside Kubernetes cluster on VMWare with a ControlPlane and one worker node. From the ControlPlane node my client can communicate with Kafka, but from my worker node this ends up in this error

   %3|1638529687.405|FAIL|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap]: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)
   %3|1638529687.406|ERROR|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:app]: apollo-prototype-765f4d8bcf-bjpf4#producer-2: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)

This is my Kafka cluster manifest (using Strimzi)

listeners:
  - name: plain
    port: 9092
    type: internal
    tls: false
    authentication:
      type: scram-sha-512
  - name: external
    port: 9094
    type: ingress
    tls: true
    authentication:
      type: scram-sha-512
    configuration:
      class: nginx
      bootstrap:
        host: localb.kafka.xxx.com
      brokers:
      - broker: 0
        host: local.kafka.xxx.com

To be mentioned that exactly the same config, when i run inside in a cloud works flawleslly.

Telnet and nslookup (from both nodes) throw up an error. CoreDNS logs do not even mention this error. Also firewall is disable on both nodes.

Could you please help me out? Thanks!


UPDATE: SOLUTION Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing

2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down

Here is what I have done and it worked like a charm!

The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip
Jonas
  • 121,568
  • 97
  • 310
  • 388
Oana
  • 537
  • 5
  • 11
  • So, where does the Kafka cluster run? On the worker node from which you cannot connect? Or on some other nodes? I think the Strimzi config looks good - at least the parts shared here. – Jakub Dec 03 '21 at 11:56
  • Kafka runs on the control plane(only one replica) – Oana Dec 03 '21 at 12:06
  • and if i move strimzi cluster operator on the worker node (kafka and zookeeper are staying on the control plane node) then strimzi cluster operator logs complains that i cannot connect to zookeeper **2021-12-03 12:39:26 ERROR Util:149 - Reconciliation #1(watch) Kafka(default/my-cluster): Exceeded timeout of 300000ms while waiting for ZooKeeperAdmin connection to my-cluster-zookeeper-0.my-cluster-zookeeper-nodes.default.svc:2181 to be connecte** – Oana Dec 03 '21 at 12:57
  • what is your kubernetes version? – Bazhikov Dec 03 '21 at 15:32
  • _if i move strimzi cluster operator ..._ - that suggests the operator has the same networking / DNS issue as the client. I don't think this is really a Strimzi error but rather an issue with the Kube cluster it self. So maybe it is worth providing some logs from the Kube cluster as well (but I'm not really expert on this area TBH). – Jakub Dec 03 '21 at 19:35
  • @Jakub indeed wasn't strimzi related, it seems that Calico was the issue here; thanks for your intervention :) – Oana Dec 06 '21 at 09:32
  • @Bazhikov, both client and server, have v1.21.0 but i was able to understand was the problem; i will detail this in a comment; thanks :) – Oana Dec 06 '21 at 09:33

1 Answers1

1

Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing

2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down

Here is what I have done and it worked like a charm!

The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip
Oana
  • 537
  • 5
  • 11