3

This is high level consumer polling every 1 sec. session time out 10 secs. Heart beat interval 3 secs. I was expecting the consumer to reconnect automatically after the session timeout. This is the expected behavior for librdkakfka where consumer can just dumb call "consume" in a loop and any network disconnects like this should be automatically handled by library.

I noticed when the cluster goes down and comes back, the consumer is able to reconnect automatically. Whereas in this case due to local network issue, the heartbeat request didn't go through and got disconnected. The producers didn't had this issue, when the network issue got resolved in a minute, they were able to produce to cluster without any issue.

From the logs

LOG-5-REQTMOUT: [thrd:GroupCoordinator]: GroupCoordinator/25: Timed out HeartbeatRequest in flight (after 10377ms, timeout #0) LOG-4-REQTMOUT: [thrd:GroupCoordinator]: GroupCoordinator/25: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests ERROR (Local: Timed out): GroupCoordinator: 1 request(s) timed out: disconnect (after 3498718ms in state UP) RebalanceCb: Local: Revoke partitions: LOG-4-COMMITFAIL: [thrd:main]: Offset commit (unassign) failed for x/x partition(s): Local: Waiting for coordinator: ERROR (Local: Broker transport failure): ssl://xxxx:p: Receive failed: SSL transport error: Connection timed out (after 3514121ms in state UP ) ERROR (Local: All broker connections are down): 11/11 brokers are down LOG-4-REQTMOUT: [thrd:GroupCoordinator]: GroupCoordinator/25: Timed out 0 in-flight, 0 retry-queued, 2 out-queue, 0 partially-sent requests

Youli Luo
  • 167
  • 1
  • 13

1 Answers1

2

Got resolved after upgrading librdkafka to 1.3.0

Youli Luo
  • 167
  • 1
  • 13