1

which one is recommended to use : 1. Single kafka stream consuming from multiple topics 2. Different kafka streams consuming from different topics (I've used this one already with no issues encountered)

Is it possible to achieve #1 ? and if yes, what're the implications? and if I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?

kafka version : 2.2.0-cp2

kuti
  • 161
  • 1
  • 3
  • 13

1 Answers1

4

Is it possible to achieve #1 (Single kafka stream consuming from multiple topics)

Yes, you can use StreamsBuilder#stream(Collection<String> topics)

If the data that you want to process is spread across multiple topics and that these multiple topics constitute one single source, then you can use this, but not if you want to process those topics in parallel.

It is like one consumer subscribing to all these topics which also means one thread for consuming all the topics. When you call poll() it returns ConsumerRecords from all the subscribed topics and not just one topic.

In Kafka streams, there is a term called Topology, which is basically a acyclic graph of sources, processors and sinks. A topology can contain sub-topologies.

Sub-topologies can then be executed as independent stream tasks through parallel threads (Reference)

Since each topology can have a source, which can be a topic, and if you want parallel processing of these topics, then you have to break-up your graph to sub-topologies.

If I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?

When messages reach sink processor in a topology, then its source must be committed, where a source can be a single topic or collection of topics.

Multiple topics or one topic, we need to send offsets to the transaction from the producer, which is basically Map<TopicPartition, OffsetMetadata> that should be committed when the messages are produced.

So, I think it should not introduce any complexities whether it is single topic having 10 partitions or 10 topics with 1 partition each, because offset is at the TopicPartition level and not at topic level.

JavaTechnical
  • 8,846
  • 8
  • 61
  • 97
  • When I started with a kafka-stream consuming from TOPIC1 and doing some parsing with `EXACTLY_ONCE` settings. Everything is working fine. But the issue started occurring (timeout) is when I added another kafka topic(TOPIC2) to the same stream. The error is given below : – kuti Jun 18 '20 at 17:53
  • `ERROR org.apache.kafka.streams.processor.internals.StreamTask - task [1_0] Timeout exception caught when initializing transactions for task 1_0. This might happen if the broker is slow to respond, if the network connection to the broker was interrupted, or if similar circumstances arise. You can increase producer parameter `max.block.ms` to increase this timeout.` AND `org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.` – kuti Jun 18 '20 at 17:55
  • @kuti Did you try the workarounds mentioned, like increasing the `max.block.ms` or checking the conncectivity of your broker? – JavaTechnical Jun 19 '20 at 05:21