0

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.

This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send

  1. one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
  2. one control message per key (so every active partition would be getting at least one control message)

Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?

According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.

One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.

Any ideas?

Evgeniy Berezovsky
  • 18,571
  • 13
  • 82
  • 156

1 Answers1

2

You can use a global store which sources data from all the partitions.

From the documentation,

Adds a global StateStore to the topology. The StateStore sources its data from all partitions of the provided input topic. There will be exactly one instance of this StateStore per Kafka Streams instance.

The syntax is as follows:

public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
                                     String topic,
                                     Consumed consumed,
                                     ProcessorSupplier stateUpdateSupplier)

The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.

The global store is per stream instance, so you get all the topic data in every stream instance.

In the process(K key, V value), you can write your processing logic.

A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

JavaTechnical
  • 8,846
  • 8
  • 61
  • 97
  • Note, that the `Processor` cannot really do any processing. It should only put the data into the state store, but should not filter/modify the data. Cf https://issues.apache.org/jira/browse/KAFKA-8037 – Matthias J. Sax Jan 03 '21 at 20:57