Odd behaviour on repartitioning existing data in Kafka

Question

We have the following situation:

Existing topic with 9 partitions in Kafka contains multiple record types. These are partitioned according to a custom header (key = null) which is basically a string UUID.
Data is consumed via Kstreams, filtered by the type that interests us and repartitioned into a new topic containing only specific record types. The new topic contains 12 partitions and has key=<original id in header>. The increased partition count is to allow more consumers to process this data.

This is where things seem to get a little weird.

In the original topic, we have millions of the relevant records. In each of the 9 partitions, we see relatively monotonically increasing record times, which is to be expected as the partitions should be assigned relatively randomly due to the high cardinality of the partition key.

In the new topic, we're seeing something like the following:

Seemingly the record timestamps are jumping all over the place. Some discrepancies are to be expected seeing how the partitioning in the original (as well as the new) topic isn't exactly round-robin. We're seeing a few partitions in our original topic which have offsets that are ~1-2M higher/lower than others, but seeing how we have many millions of records of ingest daily, I can't explain the one record with a time stamp of 5/28/2022 between 6/17/2022 and 6/14/2022.

What could explain this behaviour?

Edit:

Looking at the consumer group offsets, I've found something interesting:

I was reingesting the data with multiple consumers and noted that they have severly different lags per partition. I don't quite understand why this discrepancy would be so large. Going to investigate further...

Edit:

To add some more detail, the workflow of the Streams app is as follows:

SpecificAvroSerde<MyEvent> specificAvroSerde = new SpecificAvroSerde<>();
specificAvroSerde.configure(Collections.singletonMap(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL), /*isKey*/ false);

streamsBuilder
        .stream("events", Consumed.with(Serdes.Void(), Serdes.ByteArray()))
        .transform(new FilterByTypeHeaderTransformerSupplier(topicProperties))
        .transform(new MyEventAvroTransformerSupplier())
        .to(topicProperties.getOutputTopic(), Produced.with(Serdes.UUID(), specificAvroSerde));

where the FilterByTypeHeaderTransformerSupplier instantiates a transformer that does, in essence:

public KeyValue<Void, byte[]> transform(Void key, byte[] value) {
    // checks record headers
    if (matchesFilter()) {
        return KeyValue.pair(key, value);
    }
    // skip since it is not an event type that interests us
    return null;
}

while the other transformer does the following (which doesn't have great performance but does the job for now):

public KeyValue<UUID, MyAvroEvent> transform(Void key, byte[] value) {
    MyEvent event = objectMapper.readValue(value, MyEvent.class);
    MyAvroEvent avroRecord = serializeAsAvro(event);
    return KeyValue.pair(event.getEventId(), avroRecord);
}

hence I use the default timestamp extractor (FailOnInvalidTimestamp).

Most notably, as can be seen, I'm adding a key to this record: however, this key is the same one that was previously used to partition the data (in the existing 9 partitions, however).

I'll try removing this key first to see if the behaviour changes, but I'm kind of doubtful that that's the reason, especially since it's the same partition key value that was used previously.

I still haven't found the reason for the wildly differing consumer offsets, unfortunately. I very much hope that I don't have to have a single consumer reprocess this once to catch up, since that would take a very long time...

Edit 2:

I believe I've found the cause of this discrepancy. The original records were produced using Spring Cloud Stream - these records included headers such as e.g "scst_partition=4". However, the hashing the was used for the producer back then used Java based hashing (e.g. "keyAsString".hashCode() % numPartitions), while the Kafka Clients use:

Utils.toPositive(Utils.murmur2(keyAsBytes))

As a result, we're seeing behaviour where records in e.g. source partition 0 could land in any one of the new partitions. Hence, small discrepancies in the source distribution could lead to rather large fluctuations in record ordering in the new partitions.

I'm not quite sure how to deal with this in a sensible manner. Currently I've tried using a simple round-robin partitioning in the target topic to see if the distribution is a bit more even in that case.

The reason why this is a problem is that this data will be put on an object storage via e.g. Kafka Connect. If I want this data stored in e.g. a daily format, then old data coming in all the time would cause buffers that should've been closed a long time ago to be kept open, increasing memory consumption. It doesn't make sense to use any kind of windowing for late data in this case, seeing how it's not a real-time aggregation but simply consumption of historical data.

Ideally for the new partitioning I'd want something like: given the number of partitions in the target topic is a multiple of the number of partitions in the source topic, have records in partition 0 go to either partition 0 or 9, from 1 to either 1 or 10, etc. (perhaps even randomly)

This would require some more work in the form of a custom partitioner, but I can't foresee if this would cause other problems down the line.

I've also tried setting the partition Id header ("kafka_partitionId" - as far as I know, documentation here isn't quite easy to find) but it is seemingly not used.

I'll investigate a bit further...

Final edit:

For what it's worth, the problem boiled down to the following two issues:

My original data, written by Spring Cloud Stream, was partitioned differently that how a vanilla Kafka Producer (which Kafka Streams internally uses) would. This led to data jumping all over the place from a "record-time" point of view.
Due to the above, I had to choose a number of partitions that was a multiple of the previous number of partitions as well as use a custom partitioner which does it the "spring cloud stream".

The requirement that the new number be a multiple of the previous one is a result of modular arithmetic. If I wished to have deterministic partitioning for my existing data, having a multiple would allow data to go into one of two possible new partitions as opposed to only one as in the previous case.

E.g. with 9 -> 18 partitions:

id 1 -> previously hashed to partition 0, now hashes to either 0 or 9 (mod 18)
id 2 -> previously hashed to partition 1, now hashes to either 1 or 10 (mod 18)

Hence my requirement for higher paralellism is met and the data inside a single partition is ordered as desired, since a target partition is only supplied from at most one source partition.

I'm sure there might have been a simpler way to go about this all, but this works for now.

For further context/info, see also this Q&A.

To clarify, is your Streams app using a TimestampExtractor? Or is the producer explicitly setting the record timestamp based on a field in the event? That's the only reason why the timestamps would be mixed up (but you shouldn't use this to decipher record order, anyway) — OneCricketeer, Jul 29 '22 at 18:36
@OneCricketeer Thanks for the comment. I'm not using a custom timestamp extractor, meaning `FailOnInvalidTimestamp` is used which in turn uses the record timestamp. I've updated the question with some more details which might prove helpful... — filpa, Aug 01 '22 at 08:34
I see you have `Produced.with(Serdes.UUID()`. It's possible your UUID's only hash into a handful of partitions, and are not evenly distributed. In other words, two or more UUID's may have a partition collision after hashed — OneCricketeer, Aug 01 '22 at 17:56
@OneCricketeer Thanks for the hint. It seems that it is indeed related. I've updated the question with more details, if you're able to take a look. — filpa, Aug 02 '22 at 11:26
"hashing the was used for the producer back then" - I'm not sure about this, but I believe Kafka/Spring always used murmur2 hashes during partitioning... Regarding your kafka connect comment - yes, S3 sink for example, can easily write into older partitions. Topic partitions don't matter for that, though if you're using the Connector time based partitioner — OneCricketeer, Aug 02 '22 at 13:29
*Kafka/Spring always used murmur2* Unfortunately it seems like this is not the case for Spring Cloud Stream, see my question on this [here](https://stackoverflow.com/questions/73207602/why-does-spring-cloud-stream-with-kafka-binder-hash-keys-differently-than-a-stan?noredirect=1#comment129293663_73207602). So, it unfortunately seems that records produced through Spring Cloud Stream 'normally' produced records are not able to be co-partitioned, by default. :( — filpa, Aug 02 '22 at 14:38
I'm not quite sure I understood your S3 sink comment. Since my new records are all over the place, I'm seeing connect logs saying something like `writing 04/21/22`, `writing 04/25/22`, `writing 05/01/22`, `writing 04/28/22`, etc. (in e.g., that order, whenever record belonging to a specific storage partition are encountered). Would this not lead to more memory consumption over time, as old messages are encountered? I am indeed using a `DailyPartitioner` for writing the data into S3. — filpa, Aug 02 '22 at 14:38
AFAIK, the Spring-Cloud-Stream Kafka Binder does use Kafka's Partitioner. It is other binders that may be configured with other partitioners (I could be wrong, as I am not a Spring dev like Gary there)... Yes, there is memory usage per partition-buffer, but no more than `flush.size` or if you are using the scheduled rotation interval, then that is the longest it'll hold onto data. But again, the timestamp is irrelevant to the topic partition — OneCricketeer, Aug 02 '22 at 14:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246973/discussion-between-filpa-and-onecricketeer). — filpa, Aug 02 '22 at 15:15
The Kafka topic partition doesn't determine the S3 sink timestamp partition. Your producers are simply sending late data and/or overriding the record timestamp. You cannot fix this from the consumer side, as S3 sink doesn't know there will ever be "late data" and it never "closes" a timed partition... If you really want to know what data is being sent to S3, if late, and needs re-processed, you can use a lambda job to trigger only S3 prefixes less than "today" — OneCricketeer, Aug 03 '22 at 22:37
Got it, makes sense - thanks for your efforts. If you'd like to write a succinct summary as an answer, I'll gladly accept it. Otherwise I'll write an answer in a couple days so that other that may have a similar issue might get some help in the future. — filpa, Aug 11 '22 at 15:16

Odd behaviour on repartitioning existing data in Kafka

0 Answers0

Linked