The rdkafka (C++) implementation for Kafka consumer side code returns a single message from a call to .poll()
.
The Java implementation returns a set of records, which is controlled, in part, by max.poll.records
. (As well as other configuration parameters.)
There was some commentary on Github about changing the rdkafka API to return multiple records, however this has not yet been done, and there appears to be no plan to do so. However, from reading the thread, it became apparent that rdkafka buffers a number of records before releasing them as individual records via successive calls to poll()
.
https://github.com/confluentinc/librdkafka/issues/1653
So it appears poll()
in rdkafka implementation works like this:
poll()
contacts the broker and obtains a number of records (likely more than 1)- These records are stored in a buffer somewhere
- Successive calls to poll do not contact the broker, but return a single record from the buffer
- When all records in the buffer are depleted, a new batch is obtained from the broker
The question I have is this. Does this implementation not open up the possibility of losing records? It is a hazardous implementation? Can it "loose" data if there are records remaining in the buffer and the process dies?
Re-phrased another way: How does commit and particularly auto commit work with respect to this implementation? Auto commit is periodic, and automatic. Is it possible to encounter a situation where auto commit commits an offset, and that offset is currently awaiting processing on the consumer side code? Then the consumer side process dies or crashes and an offset is committed which is greater than the actual offset which has been processed?
Also relevant: https://kafka.apache.org/26/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html