0

The rdkafka (C++) implementation for Kafka consumer side code returns a single message from a call to .poll().

The Java implementation returns a set of records, which is controlled, in part, by max.poll.records. (As well as other configuration parameters.)

There was some commentary on Github about changing the rdkafka API to return multiple records, however this has not yet been done, and there appears to be no plan to do so. However, from reading the thread, it became apparent that rdkafka buffers a number of records before releasing them as individual records via successive calls to poll().

https://github.com/confluentinc/librdkafka/issues/1653

So it appears poll() in rdkafka implementation works like this:

  • poll() contacts the broker and obtains a number of records (likely more than 1)
  • These records are stored in a buffer somewhere
  • Successive calls to poll do not contact the broker, but return a single record from the buffer
  • When all records in the buffer are depleted, a new batch is obtained from the broker

The question I have is this. Does this implementation not open up the possibility of losing records? It is a hazardous implementation? Can it "loose" data if there are records remaining in the buffer and the process dies?

Re-phrased another way: How does commit and particularly auto commit work with respect to this implementation? Auto commit is periodic, and automatic. Is it possible to encounter a situation where auto commit commits an offset, and that offset is currently awaiting processing on the consumer side code? Then the consumer side process dies or crashes and an offset is committed which is greater than the actual offset which has been processed?

Also relevant: https://kafka.apache.org/26/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
FreelanceConsultant
  • 13,167
  • 27
  • 115
  • 225
  • Won't add as an answer but the description in this article suggests that the commit values are stored by poll, and the offset value used to commit is the value from the *previous* call to poll. https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/#:~:text=The%20offset%20is%20a%20simple,because%20of%20the%20current%20offset. – FreelanceConsultant Aug 20 '23 at 12:26
  • While this could result in records being processed twice, assuming this information is correct, the scenario I detailed in the question is not a concern. Records cannot get "lost" but might get processed more than once. – FreelanceConsultant Aug 20 '23 at 12:27
  • It might be the information in the link is wrong, however. – FreelanceConsultant Aug 20 '23 at 12:27
  • librdkafka isn't designed to follow Java spec exactly. As answered in other question, use StreamConsumer to not need to repeatedly call poll yourself for Rust client – OneCricketeer Aug 20 '23 at 14:21

1 Answers1

0

Can it "lose" data if there are records remaining in the buffer and the process dies?

Yes, but only if Kafka also loses the records due to retention policy. Otherwise, the next startup of the consumer will poll from the last available committed offset.

possible to encounter a situation where auto commit commits an offset, and that offset is currently awaiting processing on the consumer side code?

Yes, but the Java API has the same problem since it defaults to read 500 records at once, therefore will auto commit in batches of the same size.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245