1

Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.

Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?


Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):

builder
    .stream("messages")
    .groupBy((key, msg) -> msg.field)
    .count();

Given this code, Kafka Streams would read messages topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field as a key.

One simple way to render this round-trip unnecessary is to write the original messages topic partitioned by the msg.field in the first place. But Kafka Streams knows nothing about messages topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.

Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.

What I'm looking for is basically something like this:

builder
    .stream("messages")
    .assumeGroupedBy((key, msg) -> msg.field)
    .count();

where assumeGroupedBy would mark stream as already partitioned by msg.field. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.

Boris Sukhinin
  • 110
  • 1
  • 7
  • 1
    Thanks for updating our answer, Boris. Have you checked the `groupByKey()` function (instead of `groupBy()`, which _always_ causes repartitioning of its input data)? It assumes that the input data is already partitioned as needed as per the existing message key. In your example, `groupByKey()` would work if `key == msg.field`. – miguno Dec 07 '20 at 19:08
  • I've missed the most obvious solution. What a shame! Somehow I assumed that `groupByKey` requires setting the key by calling `selectKey` or `groupBy` beforehand. Wish documentation explicitly stated that this is not necessary. @MichaelG.Noll may I kindly ask you to update your answer so I could mark it as accepted? – Boris Sukhinin Dec 08 '20 at 15:56
  • No worries, it is easy to miss! I updated my answer. – miguno Dec 09 '20 at 10:35
  • Regarding docs confusion: any suggestion on how we can improve it? https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html says today: "Causes data re-partitioning if and only if the stream was marked for re-partitioning. groupByKey is preferable to groupBy because it re-partitions data only if the stream was already marked for re-partitioning. However, groupByKey does not allow you to modify the key or key type like groupBy does." – miguno Dec 09 '20 at 10:37
  • I think my misunderstanding was because of this `groupBy()` description: "groupBy is a shorthand for selectKey(...).groupByKey()". I thought of a case when you have a modular topology with some parts that could be turned on or off (e.g. in config), and instead of having multiple repartitions with `[.groupBy().doSmth1()][groupBy().doSmth2()]` you could write `.selectKey()[.groupByKey().doSmth1][.groupByKey().doSmth2]`. But it's probably completely on my part. – Boris Sukhinin Dec 09 '20 at 11:05
  • 1
    Regarding the docs, does this make sense? "Causes data re-partitioning if and only if the stream was marked for re-partitioning. groupByKey is preferable to groupBy because it re-partitions data only if the stream was already marked for re-partitioning, _otherwise it assumes that the input data is already partitioned as needed as per the existing message key. If you simply want to aggregate the data without incurring a repartitioning operation, then all you need is to use groupByKey()_. However, groupByKey does not allow you to modify the key or key type like groupBy does." @MichaelG.Noll – Boris Sukhinin Dec 09 '20 at 11:06

1 Answers1

2

Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey() instead of groupBy(). Whereas groupBy() always results in repartitioning, its sibling groupByKey() assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey() would work if key == msg.field.

Original answer below:

Repartitioning a high-volume topic in Kafka Streams could be very expensive.

Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).

Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?

Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy() function. Hence there is no need to tell it "not to partition" as you say in your question.

One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.

Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?

miguno
  • 14,498
  • 3
  • 47
  • 63
  • 1
    Yes, the question is about processing topics with millions of events per second. I've updated my question and tried to better explain what I'm trying to achieve. – Boris Sukhinin Dec 07 '20 at 18:06