Kubernetes: kafka pod rechability issue from another pod

Question

I know the below information is not enough to trace the issue but still, I want some solution.

We have Amazon EKS cluster.

Currently, we are facing the reachability of the Kafka pod issue.

Environment:

Total 10 nodes with Availability zone ap-south-1a,1b
I have a three replica of the Kafka cluster (Helm chart installation)
I have a three replica of the zookeeper (Helm chart installation)
Kafka using external advertised listener on port 19092
Kafka has service with an internal network load balancer
I have deployed a test-pod to check reachability of Kafka pod.
we are using cloud-map based DNS for advertized listener

Working:

When I run telnet command from ec2 like telnet 10.0.1.45 19092. It works as expected. IP 10.0.1.45 is a loadbalancer ip.
When I run telnet command from ec2 like telnet 10.0.1.69 31899. It works as expected. IP 10.0.1.69 is a actual node's ip and 31899 is nodeport.

Problem:

When I run same command from test-pod. like telnet 10.0.1.45 19092. It works sometime and sometime it will gives an error like telnet: Unable to connect to remote host: Connection timed out

The issue is something related to kube-proxy. we need help to resolve this issue.

Can anyone help to guide me? Can I restart kube-proxy? Does it affect other pods/deployments?

Did you look at the advertised_listener setting in server.properties? — K.Nicholas, Apr 13 '21 at 13:48
Are you using NLB instead of ALB or regular ELB by any chance? — castel, Apr 13 '21 at 13:54
All Kafka configuration working well outside the pod. If I use it from EC2 using kafkacat utility then it works perfectly. But facing issue only inside pod. So there is no issue in Kafka deployment. It's something related to only kubernetes — NIrav Modi, Apr 13 '21 at 14:00

castel · Accepted Answer · 2021-04-16T08:26:46.450

I believe this problem is caused by AWS's NLB TCP-only nature (as mentioned in the comments).

In a nutshell, your pod-to-pod communication fails when hairpin is needed.

To confirm this is the root cause, you can verify that when the telnet works, kafka pod and client pod are not in the same EC2 node. And when they're in the same EC2 server, the telnet fails.

There are (at least) two approaches to tackle this issue:

Use K8s internal networking - Refer to k8s Service's URL

Every K8s service has its own DNS FQDN for internal usage (meaning using k8s network only, without reaching the LoadBalancer and come back to k8s again). You can just telnet this instead of the NodePort via the LB. I.e. let's assume your kafka service is named kafka. Then you can just telnet kafka.svc.cluster.local (on the port exposed by kafka service)

Use K8s anti-affinity to make sure client and kafka are never scheduled in the same node.

Oh and as indicated in this answer you might need to make that service headless.

hope it helps! (and please mind the type of service you have, in case you need a headless one - see last line of the post) — castel, Apr 13 '21 at 14:31

Kubernetes: kafka pod rechability issue from another pod

1 Answers1