intermittent issue with kafka (aws msk) consumer

Question

We are facing a strange issue in only one of our environment (with same consumer app).

Basically, it is observed that suddenly a lag starts to build up with only one of the topics on kafka broker (it has multiple topics), with 10 consumer members under a single consumer group.

Even after multiple restarts, adding another pod of consumer application, changing defaults configuration properties (max poll records, session timeout) so far have NOT helped much.

Looking for any suggestions, advice on how to possibly debug the issue (we tried enabling apache logs, cloud watch etc, but so we only saw that regular/periodic rebalancing is happening, even for very low load of 7k messages waiting for processing).

Below are env details:

App - Spring boot app on version 2.7.2 Platform
AWS Kafka - MSK
Kafka Broker - 3 brokers (version 2.8.x)
Consumer Group - 1 with 15 members (partition 8, Topic 1)

If you have 8 partitions, then you can't have more active instances than that in the group — OneCricketeer, Aug 09 '22 at 18:23
apologies, it has 15 partitions , 3 brokers , with 10 desired consumers. Now it is rebalancing even with 200 messages waiting to be processed. Seeing "request joining group due to: group is already rebalancing" in spring boot app logs — Arpit S, Aug 09 '22 at 21:30
Is it possible your instances are having uncaught exceptions (perhaps on one partition), then crashing (possibly with no log output), and being restarted by some other process (k8s pod restart policy)? — OneCricketeer, Aug 09 '22 at 21:35
you mean exception in spring boot app after message being read by a consumer assigned to a partition? — Arpit S, Aug 09 '22 at 21:42
Correct. For example, deserialization error, or some other error in your KafkaListener — OneCricketeer, Aug 09 '22 at 21:43
yet to see any application error in logs. The messages are getting processed at a very slow rate, and when they do, no errors are seen. Wondering is there any other logging or anything that can help in further debugging? — Arpit S, Aug 09 '22 at 21:47
You can try configuring the logger for debug logs, sure. I'm just guessing here... One reason why your app isn't stable would be that k8s is restarting your pods, because they are crashing. Could even be the container dies without logs from OOM, but you should see that in `kubectl describe` event output (exit code 137) — OneCricketeer, Aug 09 '22 at 21:50

intermittent issue with kafka (aws msk) consumer

0 Answers0