1

Occasionally, my kafka streams application dies with the following error:

[-StreamThread-4] o.a.k.s.p.i.AssignedStreamsTasks : Failed to commit stream task 0_9 due to the
following error:
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully
committing offsets {my-topic-9=OffsetAndMetadata{offset=5840887122, leaderEpoch=null, metadata=''}}

From the docs I assume the 60000ms originate from the property: default.api.timeout.ms. So I could probably just increase this timeout. But what other options do I have?

My application runs with processing-guarantee: exactly_once and for that I found the following in the documentation:

commit.interval.ms: The frequency with which to save the position of the processor. (Note, if processing.guarantee is set to exactly_once, the default value is 100, otherwise the default value is 30000.

So the commit interval is quite low in my case. Why does it have to be so low for exactly_once? Could I increase the interval to reduce the number of commits and thereby relax the situation?

What other options do I have?

D-rk
  • 5,513
  • 1
  • 37
  • 55

1 Answers1

2

Increasing the timeout is certainly an option. There is actually work in progress to make Kafka Streams more resilient to timeout exceptions: https://cwiki.apache.org/confluence/display/KAFKA/KIP-572%3A+Improve+timeouts+and+retries+in+Kafka+Streams

About commit.interval.ms: it is set low to keep the end-to-end latency of your application low. As long as a transaction is pending, downstream consumers (in "read_committed" mode) cannot consume the data thus experience additional latency until the transaction is committed. For Kafka Streams applications with potentially multiple repartition steps, it's essential to commit frequently to keep latency low.

Hence, depending on your latency requirements you may or may not be able to increase the commit interval.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • But when my problem is that the broker cannot commit offsets in a minute at times, this also affects the latency of my application. But some orders of magnitude worse. Wouldn't it be reasonable to double the commit interval in any case? An other option would be to enlarge the cluster to handle the load better. – D-rk Apr 04 '20 at 09:19
  • 1
    Hard to say. The question is really what the root cause is why the commit fails. If you understand the root cause, you can react accordingly. – Matthias J. Sax Apr 05 '20 at 21:13
  • Hi, I'm experiencing this problem on this situation. Topology: Mostly DSL API, some PAPI. Around 30 topics incl some repartitions. Kafka: 3 brokers. Kafka Streams: launched a big number of members in the group(96 members) to try to reduce a milliions-to-hundred-millions-records-lag that got introduced for other reasons. @MatthiasJ.Sax could you state a number of reasons why this could happen? Any monitoring to check especially? Thanks! – xmar Jun 08 '22 at 09:53
  • Given that you have only 3 brokers, but 96 consumer (with very frequent commits), could it be that you overload the group coordinator? -- There are some consumer metrics on commit latency (cf https://kafka.apache.org/documentation/#consumer_group_monitoring) -- Kafka Streams also exposes metric about it (thread and task granularity: https://kafka.apache.org/documentation/#kafka_streams_thread_monitoring -- Hope this helps. – Matthias J. Sax Jun 08 '22 at 20:05