1

On kafka Streams(version: 2.3.1), we are facing issues with committing offsets:

org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets

This is not happening regularly. What could be the reason?

Also, since TimeoutException is a retriable exception, we were planning to increase retries if it's something intermittent error.

Will it help in anyway? We have ATLEAST_ONCE processing guarantee and ordering matters in our use-case.

We hope it won't impact offsets order in any way, since KStreams commits offsets in sync, and if one commit fails and is being retried, the particular stream thread will not process new records and offsets would not be disturbed.

underscore_d
  • 6,309
  • 3
  • 38
  • 64
optimus
  • 33
  • 5

1 Answers1

1

If the TimeoutException is transient it could be stemming from intermitted network issues or an overload of some brokers, in which case, increasing the retries may help.

However it might be better in terms of latency to discover the root cause of the timeout. Inorder to find to root cause you might want to look into the meterics more. Here is a blog that will get you started https://sematext.com/blog/kafka-metrics-to-monitor/

It might be fixable by giving the bottleneck more resources.

wcarlson
  • 216
  • 1
  • 9
  • 1
    Thanks, the error is transient and it makes sense to monitor metrics and address bottleneck with more resources. But just curious if confluent cloud's managed kafka has any solution to address this by listening to JMX metrics and auto-scale. – optimus Aug 27 '20 at 09:22
  • the Confluent managed cloud should abstract the idea of brokers and auto scale completely for you. I don't know many details about that sadly. there is also this feature (https://cwiki.apache.org/confluence/display/KAFKA/KIP-572%3A+Improve+timeouts+and+retries+in+Kafka+Streams) that is currently being worked on that might help you with your problem. It should be in the next release 2.7 – wcarlson Aug 27 '20 at 18:38