Kafka state store stand by task takes long time when becomes active

Question

We are facing an issue with kafka state store which has a standby replica.

We have two topologies(streams). One without any internal state store and one with an internal state store.

We have problem with the topology which has the state store. So here I am giving info about that topology only:

The active and standby tasks of that stream are running in two seperate kubernetes pods in two nodes.

The active tasks consumes some kafka messages , processes them and write them in the state store. The pod that holds the active task is killed. The stand by task is assigned to the changelog topics. But the consumer is seeking to EARLIEST offset. Since this changelog topic has thousands of messages it takes very long time complete.

I was expecting that the latest offset from the previous active task should be available to the standby (now active) task so that it would not need to pump up all the events.

Please correct me as I ran out of options and dont know where to check furthur.

I am using kafka-streams version 2.5.0

Configuration:

These are the configured values for the topology:

    Properties properties = new Properties();
    properties.put(StreamsConfig.APPLICATION_ID_CONFIG, "myService");
    properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
    properties.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, ONE_THOUSAND);
    properties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
    
    properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
    properties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
    properties.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
    properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
    
    properties.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
    properties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 6000);
    properties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, THREE);

Topology:

 topology.addSource(AGENT_ADMIN_SOURCE, Serdes.String().deserializer(), serde.deserializer(), inboundAdminTopic)
            .addSource(MS_SOURCE_MATCH, Serdes.String().deserializer(), kafkaAvroDeserializer(), inboundMatchTopic)
            .addProcessor(MsProcessor.PROCESSOR_NAME, 
                    () -> new MsProcessor(msActionManager(), stateStoreDAO()), 
                    AGENT_ADMIN_SOURCE, MS_SOURCE_MATCH)
            .addStateStore(agentAdminStateStore(), MsProcessor.PROCESSOR_NAME)   //agent-state-store
            .addStateStore(resourceMatchingStateStore(), MsProcessor.PROCESSOR_NAME)   //service-matches-state-store
            .addSink(...)
            .addSink(...);

//when the service installed this pod gets the stand by task

{"@timestamp":"2020-08-31T10:50:17.253+00:00","@version":"1","message":"stream-thread [myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1] partition assignment took 10 ms.\n\tcurrently assigned active tasks: []\n\tcurrently assigned standby tasks: [0_0]\n\trevoked active tasks: []\n\trevoked standby tasks: []\n","logger_name":"org.apache.kafka.streams.processor.internals.StreamThread","thread_name":"myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1","level":"INFO","level_value":20000}

The other pod is killed. The stand by task is expected to read from the latest offset in the change log topic. But it goes to earliest.

{"@timestamp":"2020-08-31T10:50:17.302+00:00","@version":"1","message":"[Consumer clientId=myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1-restore-consumer, groupId=null] Subscribed to partition(s): myService-agent-state-store-changelog-0,myService-service-matches-state-store-changelog-0 ", "logger_name":"org.apache.kafka.clients.consumer.KafkaConsumer","thread_name":"myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1","level":"INFO","level_value":20000}

{"@timestamp":"2020-08-31T10:50:17.306+00:00","@version":"1","message":"[Consumer clientId=myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1-restore-consumer, groupId=null] Seeking to EARLIEST offset of partition myService-agent-state-store-changelog-0","logger_name":"org.apache.kafka.clients.consumer.internals.SubscriptionState","thread_name":"myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1","level":"INFO","level_value":20000}

{"@timestamp":"2020-08-31T10:50:17.306+00:00","@version":"1","message":"[Consumer clientId=myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1-restore-consumer, groupId=null] Seeking to EARLIEST offset of partition myService-service-matches-state-store-changelog-0","logger_name":"org.apache.kafka.clients.consumer.internals.SubscriptionState","thread_name":"myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1","level":"INFO","level_value":20000}

Could you please tell why does this seek to earliest instead of the latest known offset.

Also why I am getting hundreds of these messages

{"@timestamp":"2020-08-31T11:50:39.849+00:00","@version":"1","message":"[Consumer clientId=myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1-restore-consumer, groupId=null] Subscribed to partition(s): myService-agent-state-store-changelog-0, myService-service-matches-state-store-changelog-0 ", "logger_name":"org.apache.kafka.clients.consumer.KafkaConsumer","thread_name":"myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1","level":"INFO","level_value":20000}

Changelogs should always be read from the beginning to recreate their state. Are you sure the standby replica has all the data before becoming active? — OneCricketeer, Sep 01 '20 at 22:28
thanks @OneCricketeer. Actually we delete the changelog topics before installing the service. so there is no changelog topics at the startup and the standby replica would be empty. Then the active tasks in another pod writes the entries in the state stores. I could not confirm (as I dont know how to get that info ) if the standby is getting entries when they are written by the active tasks. Please let me know if I did not understand your question or if I missed any info. — Mayil, Sep 02 '20 at 14:27
But your question says "this changelog topic has thousands of messages it takes very long time complete"... But if you delete it, then there would be no data? And the consumer group is gone, and you've set `auto.offset.reset=earliest`, so that would make it start from the beginning as well — OneCricketeer, Sep 02 '20 at 16:12
sorry @OneCricketeer.I didnt explain it well. 1. No changelog topic exists (manually deleted) before the test. 2. Topology is initialized with one active and one stand by tasks. 3. Processor writes entries in state stores.Now state stores contains,say 10000 messages. 4. Now I kill the pod.The back up task comes up in 2nd pod.But it reads data from the earliest offset. Not from offset 10000. 5.Shouldn't it read from the offset 10000 as it is the offset set by previous active task when it wrote those entries. As I am new to kafka I am strugling to understand. — Mayil, Sep 02 '20 at 18:13
No client ID is not required. I wonder if this talk (or simlar ones) will help https://speaking.gamov.io/kMSOTR — OneCricketeer, Sep 02 '20 at 21:25
@Mayil The reason for having a standy task is to avoid rebuilding the state from a changelog topic. But for this to happen, you should have two pods in the beginning, so that the standy task already takes care of keeping the standby copy of the state store upto date with the changelog topic. See : https://docs.confluent.io/current/streams/developer-guide/config-streams.html#num-standby-replicas So if you have single pod, then I don't think a standy replica will be created. Hence you might be seeing that it's being restored from `earliest` — Dhiraj, Sep 03 '20 at 02:03
@Dhiraj Actually there are two pods at the startup: //first pod {"@timestamp":"2020-08-31T10:50:17.236+00:00","message":"stream-thread [myService-dad55d68-e511-4fe3-857a-dc45deada238-StreamThread-1]...currently assigned active tasks: [0_0] currently assigned standby tasks: []} //second pod where the stand by task is initialised {"@timestamp":"2020-08-31T10:50:17.253+00:00","message":"stream-thread [myService-418e5529-d591-46b3-b1f4-1da74e3926f4-StreamThread-1]...currently assigned active tasks: [] currently assigned standby tasks: [0_0]} //Please see that thread names are not the same. — Mayil, Sep 03 '20 at 10:01
With the same code , it works fine, when we used RocksDB (PersistentKeyValueStore) instead of InMemoryKeyValueStore. When we use RocksDB the standby, when it becomes active it starts from the latest offset instead of EARLIEST. — Mayil, Sep 04 '20 at 17:21
@mayil that explains it. InMemoryKeyValueStore as the name suggests is not persisted. So it would need to consume the change log topic from the beginning. — Dhiraj, Sep 04 '20 at 23:16

score 1 · Accepted Answer · answered Dec 20 '20 at 18:28

Your expectation is correct, and what you observe is a bug that was fixed in 2.6.0 release: https://issues.apache.org/jira/browse/KAFKA-9501

The default behavior of Kafka Streams in a rebalancs is to close all tasks, and reopen them later. However, for StandbyTasks with in-memory store it implies that the state would be lost when a standby is closed in order to re-open it as an active task. For persistent stores, there is no such problem, because, well, the state is persistent to disk.

Since 2.6.0, the logic was changes and standby tasks are converted to active task without closing them, and thus the state is not lost any longer for in-memory stores.

Kafka state store stand by task takes long time when becomes active

1 Answers1