0

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.

https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html

On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.

Questions:

1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?

2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?

For example:

KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
     // Here, we generate two output records for each input record.
     // We also change the key and value types.
     // Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
    (key, value) -> {
      List<KeyValue<String, Integer>> result = new LinkedList<>();
      result.add(KeyValue.pair(value.toUpperCase(), 1000));
      result.add(KeyValue.pair(value.toLowerCase(), 9000));
      return result;
    }
  );

In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.

2 Answers2

1

This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.

When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?

If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option. I hope this clarifies your question.

Piyush Patel
  • 1,646
  • 1
  • 14
  • 26
  • Can I store these transformed results to any store, or any kind of table rather than creating new topic for it. If yes, what would be the disadvantage of that thing then. If no, then why is it not an option? – Parkash kumar Sep 29 '19 at 12:29
  • You can implement such processor yourself. Look at this question https://stackoverflow.com/questions/46524930/how-to-process-a-kafka-kstream-and-write-to-database-directly-instead-of-sending Don't forget to vote if it answers your question. – Piyush Patel Sep 29 '19 at 13:14
1

Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.

This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.

However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

Bruno Cadonna
  • 1,348
  • 7
  • 11