12

Say I have a stream of employees, keyed by empId, which also includes departmentId. I want to aggregate by department. So I do a selectKey(mapper to get departmentId), then groupByKey() (or I could just do a a groupBy(...), I assume), and then, say, count(). What exactly happens? I gather that it does a "repartition". I think what happens is that it writes to an "internal" topic, which I is just a regular topic with a derived name, created automatically. That is, shared by all instances of the stream, not just one (i.e. not local). So the aggregation is across all of the new key, not just those messages from the source stream instance (I think). Is that correct?

I've not found a comprehensive description of repartitioning. Can anybody point me to a good article on this?

mconner
  • 1,174
  • 3
  • 12
  • 24
  • I don't know where you heard the term "re-partitioning" . In my opinion, partitions store the actual messages. Therefore re-partitioning sounds scary to me. – JR ibkr Mar 07 '19 at 20:36
  • The input of Kafka streams is a topic. In a topic, producer pushes data to a partitions. Producers sends messages with same key to a same partition. Now, streams API gives you data-store to store the result of your operations. https://kafka.apache.org/20/documentation/streams/developer-guide/interactive-queries.html – JR ibkr Mar 07 '19 at 20:41
  • @JRibkr: For starters, "repartition" is mentioned 88 times in [KStream javadoc](https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/KStream.html). I assume I've got the gist of it, but I haven't seen any detailed description, and the scope of the "internal" topic might be open to interpretation. Also, your link points to interactive queries, which is not what I'm talking about. – mconner Mar 07 '19 at 21:19
  • You are correct. Interesting. This opens another door for me. I am reading about re-partitioning. Will update you once I get it. – JR ibkr Mar 07 '19 at 21:36
  • There is brief discription about repartitioning at https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/KStream.html#join-org.apache.kafka.streams.kstream.KStream-org.apache.kafka.streams.kstream.ValueJoiner-org.apache.kafka.streams.kstream.JoinWindows- . Kafka will create repartitioned topic for stream processing ${applicationId}-XXX-repartition. You can also configure how long you want to keep that internal topic. – JR ibkr Mar 07 '19 at 21:42
  • Yes, that's from the link I gave you. I was hoping for a little more detail: a clarification of the scope of the internal topic (I presume global, not local, based on the javadoc), maybe a diagram. – mconner Mar 07 '19 at 22:00
  • I don't know what information you need my friend. The document clearly says what's re-partitioning. If you really want to know what happens then read https://github.com/apache/kafka – JR ibkr Mar 07 '19 at 22:39
  • I like to get my assumptions confirmed. This is a fairly powerful feature, and the documentation, which goes into a lot of detail about partitions, doesn't really seem to say much about how repartitioning works. That said, I just found a bit under: [Managing Streams Application Topics](https://kafka.apache.org/21/documentation/streams/developer-guide/manage-topics), which confirms that the topics are "only used by that stream application". – mconner Mar 07 '19 at 23:26

1 Answers1

8

What you describe is exactly what is happening.

A repartition step is the same as a through() (auto-inserted into the processing topology) what is a shortcut of to("topic") plus builder.stream("topic").

It's also illustrated and explained in this blog post: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • 1
    Thanks. I had seen that post in my search, but with the focus on resetting, I didn't get far enough into it to see the relevance. Still, for those of us new to this, it would be nice to see a post explicitly on repartitioning. When it happens, performance implications, (slightly surprising) [security implications](https://kafka.apache.org/10/documentation/streams/developer-guide/security.html) – mconner Mar 08 '19 at 14:23