Understanding offset jumps in topic produced to by Kafka Streams

Question

This question is similar to https://stackoverflow.com/a/54637004 but I am not 100% sure that it is the exact same situation - it presents slightly differently at least.

I noticed that a topic that I am writing to using Kafka Streams, as in the above question, does not have a continuous value for the offsets in any given partition. This is expected when using exactly-once semantics, but the "gaps" seem a little confusing to me.

The behaviour I'm seeing similar is the following:

$ kcat -C -b <broker> -t <topic> -o beginning -f '%o\n' -p 0
0
1
2
3
4
5
7  <-- Gap in offsets after 6 records
8
...
13
14
16 <-- Gap in offsets after 8 records
17
...
24
25
27 <-- Gap in offsets after 10 records
...

(note that kcat uses the librdkafka default, READ_COMMITTED, when consuming)

As opposed to the picture in the above question, the "commit marker offset" (i.e. the gap) is at very different locations.

I tried to check whether there might be any aborted messages in the topic, but did not find any when reading with READ_UNCOMMITTED isolation level, which leads me to believe that the gaps are indeed due to the commit marker.

Am I understanding correctly that the different gap sizes between offsets is due to Kafka Streams producing different numbers of messages in a given transaction batch (through e.g. max.message.bytes)?

kcat skips transaction markers, I believe. How large are your transaction boundaries? — OneCricketeer, Sep 10 '22 at 01:48
I tested consumption using `READ_UNCOMMITTED` with a Spring Kafka based consumer in addition to `kcat`, both results were the same. I'm not sure what exactly *transaction boundaries* refers to. Are you referring to how large the offset jumps are? If so, they always seem to just be 1 (i.e. `5 -> 7`, `14 -> 15`, `25 -> 27`, ...). — filpa, Sep 12 '22 at 08:05
[This post describes transactions](https://www.confluent.io/blog/transactions-apache-kafka/) and markers. They are enabled by default in latest Kafka clients. By boundary, I mean, how many records are in each transaction? I don't imagine Kafka Streams puts in more than one record, so that's why you'd see "every other offset" for your own data, separate from the transaction marker offsets in the same topic — OneCricketeer, Sep 12 '22 at 13:10
Thanks for the link - very informative. *how many records are in each transaction* Sorry for misinterpreting. Unless I'm completely misinterpreting it, it seems that the number of records in each transaction for that topic-partition (at least, going by the above offsets) is different every time (first `6`, then `8`, then `10`, ...), which is why I thought this might have to do with batching different numbers of messages in a single TX. — filpa, Sep 12 '22 at 14:01
But I'm unsure as to how I would verify this. I can't say for sure that record offsets `0` to `5` were produced in the same transaction, I'm not sure how I would even check that barring `DEBUG` logs (which I will do if there's no better way to tell). Basically, I wish to understand this gap and make sure that I don't misinterpret it. I hope I've communicated the question well. — filpa, Sep 12 '22 at 14:03

Understanding offset jumps in topic produced to by Kafka Streams

0 Answers0