This question is similar to https://stackoverflow.com/a/54637004 but I am not 100% sure that it is the exact same situation - it presents slightly differently at least.
I noticed that a topic that I am writing to using Kafka Streams, as in the above question, does not have a continuous value for the offsets in any given partition. This is expected when using exactly-once semantics, but the "gaps" seem a little confusing to me.
The behaviour I'm seeing similar is the following:
$ kcat -C -b <broker> -t <topic> -o beginning -f '%o\n' -p 0
0
1
2
3
4
5
7 <-- Gap in offsets after 6 records
8
...
13
14
16 <-- Gap in offsets after 8 records
17
...
24
25
27 <-- Gap in offsets after 10 records
...
(note that kcat uses the librdkafka default, READ_COMMITTED
, when consuming)
As opposed to the picture in the above question, the "commit marker offset" (i.e. the gap) is at very different locations.
I tried to check whether there might be any aborted messages in the topic, but did not find any when reading with READ_UNCOMMITTED
isolation level, which leads me to believe that the gaps are indeed due to the commit marker.
Am I understanding correctly that the different gap sizes between offsets is due to Kafka Streams producing different numbers of messages in a given transaction batch (through e.g. max.message.bytes
)?