0

In a quarkus process we're performing below steps once the message is polled from kafka

  1. Thread.sleep(30000) - Due to business logic
  2. call a 3rd party API
  3. call another 3rd party api
  4. Inserting data in db

Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.

2022-12-02 20:02:50 INFO  [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38,     ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN  [                    kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN  [                     kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO  [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38,     ] : Sleep over!

Below is an example on how we are consuming the messages

@Incoming("my-channel")
@Blocking
CompletionStage<Void> consume(Message<Person> person) {
     String msgKey = (String) person
        .getMetadata(IncomingKafkaRecordMetadata.class).get()
        .getKey();
        // ...
      return person.ack();
}

As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown. I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.

We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.

There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process. We're unable to reproduce to this issue in Dev and UAT environments.

I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference

mp:
  messaging:
    incoming:
      my-channel:
        topic: <topic>
        group:
          id: <group id>
        connector: smallrye-kafka
        value:
          serializer: org.apache.kafka.common.serialization.StringSerializer
          deserializer: org.apache.kafka.common.serialization.StringDeserializer

Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold? Please comment if there are any other possibilities for this issue.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
KVS
  • 503
  • 1
  • 5
  • 9
  • Kafka requires you to regularly poll in a certain time. You shouldn't sleep that thread. Rather `pause()` the consumer if you need to do a lot of work... This is not a problem unique to Quarkus. Otherwise, produce to topics to call APIs and consume the responses in a chain of topics, rather than try to call multiple APIs and write to a database all in one action (and/or use Kafka Connect to actually do the database work) – OneCricketeer Dec 07 '22 at 22:48

1 Answers1

0

I have similiar issues on our production environment running different quarkus services with a simple 3-Node-Kafka-Cluster and I researched the problem a lot - with no clear answer. At the moment, I have two approaches to this problem:

  1. Make sure, you really ack or nack the kafka-message in your code. Is really every exception catched and answered with a "person.nack(exception);" (or a "person.ack(()" - depends on your failure strategy)? Make sure it is. The error Throttled-Exception is thrown, if no ack() OR nack() is performed. The problem occurres mostly, if nothing happens at all.
  2. When this does not help, I switch the commit-strategy to "latest": mp.messaging.incoming.my-channel.commit-strategy=latest This is a little slower, because the batch commit is disabled, but runs stable in my case. If you don't know about commit strategies and the default, catch up with the good article by Escoffier:

I am aware, that this does not solve the root-cause, but helped in desperate times. The problem has to be, that one or more queued messages are not acknowledged in time, but I can't tell you why. Maybe the application logic is too slow, but I have a hard time - like you - to reproduce this locally. You can also try to increase the threshold of 60 sec with throttled.unprocessed-record-max-age.ms and a see for yourself, if this helps. In my case, it did not. Maybe someone else can share his insights with this problem and can provide you with a real solution.