Why do I have to configure a state store with Kafka Streams

Question

Currently I have the following setup:

StoreBuilder storeBuilder = Stores.keyValueStoreBuilder(
    Stores.persistentKeyValueStore("kafka.topics.table"),
    new SomeKeySerde(),
    new SomeValueSerde());

streamsBuilder.addStateStore(storeBuilder);

final KStream<byte[], SomeClass> requestsStream = streamsBuilder
            .stream("myTopic", Consumed.with(Serdes.ByteArray(), theSerde));
    requestsStream
            .filter((key, request) -> Objects.nonNull(request))
            .process(() -> new SomeClassUpdater("kafka.topics.table", maxNumMatches), "kafka.topics.table");

Properties streamsConfiguration = loadConfiguration();
KafkaStreams streams = new KafkaStreams(streamsBuilder.build(), streamsConfiguration);

streams.start()

Why do I need the local state store, since I'm not doing any other computation with it and the data is also stored in the kafka changelog? Also at what moment does it store in the local store, does it store and commit to the changelog?

The problem that I'm facing is that I'm storing localy and in time I run into memory problems especially when it repartitions often. Because the old partitions still sit around and fill the memory. So my questions are, why do we need the persistence with rocksdb since:

the data is persisted in kafka changelog
ramdisk is gone anyway when the container is gone.

State stores aren't required in Kafka Streams. Are you accessing a state store in your processor? What happens when you attempt to remove it from your code? — ck1, Dec 07 '19 at 17:40
@ck1 no I'm not using it in the processor, but I still would like to know how people use it if ram disk is increased. So what is the gain in all this — adpap, Dec 10 '19 at 08:53
If you don't use it, why do you create the state store and add it to the topology and processor? Note, that the last argument that takes a state store name in `process()` is optional and you can call the method with a single argument. Hence, there is no requirement to add a state store via `addStateStore()` either. — Matthias J. Sax, Dec 13 '19 at 05:12

score 3 · Accepted Answer · answered Dec 10 '19 at 09:03

On a single thread we can have multiple tasks equal to the no. of partitions of the topic. Each partition has its own state store and these state stores save the data to a Changelog which is an internal topic of Kafka. Each state store of a partition also maintains a replica of the state store of other partition, in order to recover the data of the partition whose task may fail.

If you don't use state store, and one of your task fails, it will go to the internal topic i.e. the Changelog and then will fetch data for the partition which is a time consuming job for the CPU. Hence, maintaining State Store reduces the time in which a task may fail and fetches the data from another tasks State Store immediately.

Not sure why this answers the question? This answer explains how state store work, but the question was "why do I need to use a state store" and the answer to this question is: you don't need to use a state store... — Matthias J. Sax, Dec 13 '19 at 05:15

Why do I have to configure a state store with Kafka Streams

1 Answers1