0

I am using Kafka Streams in a critical application, and I am facing issues where transactions are getting expired in idle threads. This causes issues after re-balance where a task shifts to a previously idle thread where the producer has expired. However, this doesn't become apparent until the producer tries to send for the first time at which point it throws a ProducerFencedException and the stream shuts down. Then we need to recycle the application to get it to start processing again, which isn't acceptable.

Here is the application setup:

  • Single topic with 2 partitions
  • 4 instances of the Spring Boot application running with 2 stream threads per application instance. The reason for additional instances is that it is a critical application and we have to allow for 2 instances of the application potentially being down for server patching and still have resiliency by having multiples instances of the application running (i.e. 2). Each application is capable of doing the full load themselves within SLAs

I'd appreciate any insights anyone has into how we can setup our Kafka Streams application or Kafka cluster to not expire transactions with this setup.

Relevant Versions: Kafka cluster version: 1.0.0, Kafka client version: 1.1.0, Spring Boot version: 2.0.0

  • Kafka Stream should not shut down if a `ProducerFencedException` occurs, but should try to rebalance and re-join the consumer group. Maybe there is a bug in `1.1.0` release -- can you a newer release? – Matthias J. Sax Oct 30 '19 at 04:44
  • Maybe. Is there a version that you can recommend? – user10418702 Oct 30 '19 at 12:41
  • Also, the flow of messages we see is first, an ERROR for ProducerFencedException. Then there is a WARN that stream task got migrated to another thread already. Closing it as zombie. Following on from this we get WARN: Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now. Then another error java.lang.IllegalStateException: Record's partition does not belong to this partition-group. Then starts shutting down – user10418702 Oct 30 '19 at 12:49
  • Latest version is 2.3.1 and 2.4.0 should be release in the next weeks... So you are using a relatively old release (1.1.0 was release in March 2018...) -- newer release are in general always better than older releases. --- About the error flow. The `ProducerFencedException` is not the reason why the thread dies, but the `IllegalStateException` is: the error message you point out is the same as on this ticket that is fixed in 2.0.0 release: https://issues.apache.org/jira/browse/KAFKA-6534 – Matthias J. Sax Oct 31 '19 at 05:42
  • After doing some analysis we are updating to version 2.0.1 as there are 2 more fixes in there related to ProducerFencedException. So thanks for that. – user10418702 Nov 05 '19 at 20:54
  • @MatthiasJ.Sax a similar issue occurred in the same service. It doesn't follow the same log flow as above however. Instead we just got this WARN log "Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now. org.apache.kafka.streams.errors.TaskMigratedException... This was the last log on the thread i.e. it stopped processing without any logs to indicate this whereas in the previous example there were subsequent logs during the shutdown. Any thoughts on this one? – user10418702 Nov 06 '19 at 13:15
  • Not sure. A `TaskMigrationException` should not kill the thread and the thread should just rejoin the consumer group and continue processing. – Matthias J. Sax Nov 07 '19 at 03:23
  • Thanks. This does seem to be catered for now as well since we upgraded the library. – user10418702 Nov 08 '19 at 14:10

0 Answers0