I observed my services going down with the below exception. The reason was one of our three Kafka brokers was down. And spring was always trying to connect with the same broker. Before it can skip faulty broker and connect to the next available broker, Kubernetes is restarting the pod (due liveness probe failure configured at 60seconds). Due to restart, next time also it tries to connect to the same first faulty broker and thus pod never comes up.
How we can configure spring to not wait for more than 10seconds for a faulty broker?
I found cloud.stream.binder.healthTimeout
property but not sure if this is the right one. How I can replicate the issue in my local.
Kafka version: 2.2.1
{“timestamp”:“2020-01-21T17:16:47.598Z”,“level”:“ERROR”,“thread”:“main”,“logger”:“org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner”,“message”:“Failed to obtain partition information”,“context”:“default”,“exception”:“org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.\n”}