0

I observed my services going down with the below exception. The reason was one of our three Kafka brokers was down. And spring was always trying to connect with the same broker. Before it can skip faulty broker and connect to the next available broker, Kubernetes is restarting the pod (due liveness probe failure configured at 60seconds). Due to restart, next time also it tries to connect to the same first faulty broker and thus pod never comes up.

How we can configure spring to not wait for more than 10seconds for a faulty broker?

I found cloud.stream.binder.healthTimeout property but not sure if this is the right one. How I can replicate the issue in my local.

Kafka version: 2.2.1

{“timestamp”:“2020-01-21T17:16:47.598Z”,“level”:“ERROR”,“thread”:“main”,“logger”:“org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner”,“message”:“Failed to obtain partition information”,“context”:“default”,“exception”:“org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.\n”}

  • How about changing the order of the bootstrap.server addresses? Also, if the broker is actually down, then that node would not be returned by the Controller – OneCricketeer Jan 22 '20 at 04:42
  • We changed the order for now. But that is not a permanent fix. In the future, some other broker might go down. The right approach will be to switch to the next available broker at runtime. – Neeraj Kukreti Jan 22 '20 at 05:27
  • You don't show a full stack trace (you should always show the full stack trace for questions like this) so I don't know whether you are provisioning a producer or consumer. Try reducing `default.api.timeout.ms` (consumer) and/or `max.block.ms` (producer). – Gary Russell Jan 22 '20 at 14:21
  • What version of Kafka are you running? How many replicas does your topic actually have? Is the producer trying to ack each event? – OneCricketeer Jan 22 '20 at 14:52
  • Kafka's version is 2.2.1. Each topic has 2 replicas. – Neeraj Kukreti Jan 31 '20 at 03:05

0 Answers0