Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages
topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field
as a key.
One simple way to render this round-trip unnecessary is to write the original messages
topic partitioned by the msg.field
in the first place. But Kafka Streams knows nothing about messages
topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy
would mark stream as already partitioned by msg.field
. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.