In a quarkus process we're performing below steps once the message is polled from kafka
- Thread.sleep(30000) - Due to business logic
- call a 3rd party API
- call another 3rd party api
- Inserting data in db
Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.
2022-12-02 20:02:50 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Sleep over!
Below is an example on how we are consuming the messages
@Incoming("my-channel")
@Blocking
CompletionStage<Void> consume(Message<Person> person) {
String msgKey = (String) person
.getMetadata(IncomingKafkaRecordMetadata.class).get()
.getKey();
// ...
return person.ack();
}
As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown. I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.
We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.
There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process. We're unable to reproduce to this issue in Dev and UAT environments.
I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference
mp:
messaging:
incoming:
my-channel:
topic: <topic>
group:
id: <group id>
connector: smallrye-kafka
value:
serializer: org.apache.kafka.common.serialization.StringSerializer
deserializer: org.apache.kafka.common.serialization.StringDeserializer
Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold? Please comment if there are any other possibilities for this issue.