Streaming application with state stores takes up to 1 hour to restart

Question

We are using spring cloud stream with Kafka 2.0.1 and utilizing the InteractiveQueryService to fetch data from the stores. There are 4 stores that persist data on disk after aggregating data. The code for the topology looks like this:

@Slf4j
@EnableBinding(SensorMeasurementBinding.class)
public class Consumer {

  public static final String RETENTION_MS = "retention.ms";
  public static final String CLEANUP_POLICY = "cleanup.policy";

  @Value("${windowstore.retention.ms}")
  private String retention;

/**
 * Process the data flowing in from a Kafka topic. Aggregate the data to:
 * - 2 minute
 * - 15 minutes
 * - one hour
 * - 12 hours
 *
 * @param stream
 */
@StreamListener(SensorMeasurementBinding.ERROR_SCORE_IN)
public void process(KStream<String, SensorMeasurement> stream) {

    Map<String, String> topicConfig = new HashMap<>();
    topicConfig.put(RETENTION_MS, retention);
    topicConfig.put(CLEANUP_POLICY, "delete");

    log.info("Changelog and local window store retention.ms: {} and cleanup.policy: {}",
            topicConfig.get(RETENTION_MS),
            topicConfig.get(CLEANUP_POLICY));

    createWindowStore(LocalStore.TWO_MINUTES_STORE, topicConfig, stream);
    createWindowStore(LocalStore.FIFTEEN_MINUTES_STORE, topicConfig, stream);
    createWindowStore(LocalStore.ONE_HOUR_STORE, topicConfig, stream);
    createWindowStore(LocalStore.TWELVE_HOURS_STORE, topicConfig, stream);
  }

  private void createWindowStore(
        LocalStore localStore,
        Map<String, String> topicConfig,
        KStream<String, SensorMeasurement> stream) {

    // Configure how the statestore should be materialized using the provide storeName
    Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
            .as(localStore.getStoreName());

    // Set retention of changelog topic
    materialized.withLoggingEnabled(topicConfig);

    // Configure how windows looks like and how long data will be retained in local stores
    TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
            localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));

    // Processing description:
    // The input data are 'samples' with key <installationId>:<assetId>:<modelInstanceId>:<algorithmName>
    // 1. With the map we add the Tag to the key and we extract the error score from the data
    // 2. With the groupByKey we group  the data on the new key
    // 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
    // 4. With reduce we determine the maximum value in the time window
    // 5. Materialized will make it stored in a table
    stream
            .map(getInstallationAssetModelAlgorithmTagKeyMapper())
            .groupByKey()
            .windowedBy(configuredTimeWindows)
            .reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
  }

  private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
    TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
    timeWindows.until(retentionMs);
    return timeWindows;
  }

  /**
   * Determine the max error score to keep by looking at the aggregated error signal and
   * freshly consumed error signal
   *
   * @param aggValue
   * @param newValue
   * @return
   */
  private ErrorScore getMaxErrorScore(ErrorScore aggValue, ErrorScore newValue) {
    if(aggValue.getErrorSignal() > newValue.getErrorSignal()) {
        return aggValue;
    }
    return newValue;
  }

  private KeyValueMapper<String, SensorMeasurement,
        KeyValue<? extends String, ? extends ErrorScore>> getInstallationAssetModelAlgorithmTagKeyMapper() {
    return (s, sensorMeasurement) -> new KeyValue<>(s + "::" + sensorMeasurement.getT(),
            new ErrorScore(sensorMeasurement.getTs(), sensorMeasurement.getE(), sensorMeasurement.getO()));
  }
}

So we are materializing aggregated data to four different stores after determining the max value within a specific window for a specific key. Please note that retention which is set to two months of data and the clean up policy delete. We don't compact data.

The size of the individual state stores on disk is between 14 to 20 gb of data.

We are making use of Interactive Queries: https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html#interactive-queries

On our setup we have 4 instances of our streaming app to be used as one consumer group. So every instance will store a specific part of all data in its store.

This all seems to work nicely. Until we restart one or more instances and wait for it to become available again. I would expect that the restart of the app would not take that long but unfortunately it takes op to 1 hour. I guess that the issue is caused by the amount of data in combination of restoring state stores, but I'm not sure. I would have expected that as we persist the state store data on persisted volumes outside of the container that runs on kubernetes, the app would receive the last offset from the broker and only has to continue from that point as the previously consumed data is already there in the state store. Unfortunately I don't have a clue how to resolve this.

Restarting our app triggers a restore task:

-StreamThread-2] Restoring task 4_3's state store twelve-hours-error-score from beginning of the changelog anomaly-timeline-twelve-hours-error-score-changelog-3.

This process takes quite a while. Why is it restoring from beginning and why does it take so long? I do have auto.offset.reset set to "earliest" but that is only being used when the offset is unknown isn't it?

Here are my stream settings. Note the max.bytes.buffering set to 0. I changed this, but that didn't make a difference. I also read about a bug with the num.stream.threads where > 1 causes issues, but also putting this on 1 doesn't improve restart speed.

2019-03-05 13:44:53,360 INFO main org.apache.kafka.common.config.AbstractConfig StreamsConfig values: 
application.id = anomaly-timeline
application.server = localhost:5000
bootstrap.servers = [localhost:9095]
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 0
client.id = 
commit.interval.ms = 500
connections.max.idle.ms = 540000
default.deserialization.exception.handler = class org.apache.kafka.streams.errors.LogAndFailExceptionHandler
default.key.serde = class org.apache.kafka.common.serialization.Serdes$StringSerde
default.production.exception.handler = class org.apache.kafka.streams.errors.DefaultProductionExceptionHandler
default.timestamp.extractor = class errorscore.raw.boundary.ErrorScoreTimestampExtractor
default.value.serde = class errorscore.raw.boundary.ErrorScoreSerde
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
num.standby.replicas = 1
num.stream.threads = 2
partition.grouper = class org.apache.kafka.streams.processor.DefaultPartitionGrouper
poll.ms = 100
processing.guarantee = at_least_once
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
replication.factor = 1
request.timeout.ms = 40000
retries = 0
retry.backoff.ms = 100
rocksdb.config.setter = null
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
state.cleanup.delay.ms = 600000
state.dir = ./state-store
topology.optimization = none
upgrade.from = null
windowstore.changelog.additional.retention.ms = 86400000

It also log these messages after a while:

CleanupThread] Deleting obsolete state directory 1_1 for task 1_1 as 1188421ms has elapsed (cleanup delay is 600000ms).

Also something to note, I did add the following code in order to override the default cleanUp on start and stop where the stores by default are deleted:

@Bean
public CleanupConfig cleanupConfig() {
    return new CleanupConfig(false, false);
}

any help would be appreciated!

Not sure atm. How long if your application down? Did you try to increase the cleanup delay (config `state.cleanup.delay.ms`)? I was just checking the code, there might be a race condition that may lead to an incorrect cleanup of local state on startup... Increasing the cleanup delay should help to avoid the race condition. — Matthias J. Sax, Mar 09 '19 at 06:50
It's just down for a couple of minutes so if it would restart it would have to find the old state and continue. I will try to increase that setting. However the default is already 6000000 milliseconds so that is already very long isn't it? Does having rocksdb caching disabled, as being the default, also have impact on this? Is there also some log level that I could set on certain packages in order to find out what is going on? I tried seeing trace on org.kafka.streams but that didn't really get me something to work with — mmelsen, Mar 09 '19 at 10:43
Caching should be independent. And default value if 600,000 (ie, 10 minute) not 6,000,000 -- if you say couple if minutes, you might exceed this? Can you double check the directory timestamp? On `#start()` we register a cleaner background thread: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java#L797-L803 -- also, what happens first -- the restoring? Or the cleanup? Also, what version are you using? — Matthias J. Sax, Mar 09 '19 at 18:42
I got the default from https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#optional-configuration-parameters where it says: 6000000. Nevertheless the downtime is just 2 or 3 minutes. What should I see about the timestamp? What do you want to show me about the cleaner? The fact that it uses that property? I'll have to check the logs fit what happened first. I'm using Kafka 2.0.1 and confluent Kafka brokers 5.1.0 — mmelsen, Mar 09 '19 at 19:15
The doc is wrong (not sure why, this part is generated from the sources...). But you can see from the printed logs that it's 10 minutes: `state.cleanup.delay.ms = 600000`. And yes, the cleaner checks the "last modified" timestamp of the task directory to see if it's subject for deletion (if it's not in used -- first, the cleaner tries to get a lock that it would not get if the task is active). -- I am confused about the versions. Confluent 5.1.0 ships with AK 2.1.0 -- so your brokers are on 2.1, but your Streams app is using 2.0.1? — Matthias J. Sax, Mar 09 '19 at 19:27
I'm using streams 2.1.0.Thanks for clarifying that cleanup property. Some things I see in the logs: 14:51:10,522 -StreamThread-1 o.a.k.c.c.i.AbstractCoordinator$HeartbeatResponseHandler [Consumer clientId=at..] Attempt to heartbeat failed since group is rebalancing 14:51:10,593 -StreamThread-1 o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=at-StreamThread-1-consumer, ] Revoking previously assigned partitions [... — mmelsen, Mar 09 '19 at 19:59
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189733/discussion-between-mmelsen-and-matthias-j-sax). — mmelsen, Mar 09 '19 at 19:59
I have tried to get to the bottom of this but did not succeed. Anyone can help me out? — mmelsen, Mar 13 '19 at 18:34
@MatthiasJ.Sax I've implemented a StateResoreListener: public void onRestoreStart(...) { long toRestore = end - start; totalToRestore.put(topicPartition, toRestore); LOG.info("Starting restoration for {} on topic-partition {} total to restore {}", store, topicPartition, toRestore); } This is invoked for same partition over and over: 2019-03-22 09:35:16,03 INFO atl-737e402a-c5bb-446e-8dad-b81e918d4047-StreamThread-1 c.w.a.a.L Starting restoration for fifteen-min on topic-partition atl-fifteen-min-changelog-2 total to restore 119673932 — mmelsen, Mar 22 '19 at 09:43
Hard to say. Maybe open a bug report ticket and attach all information you could collect, including DEBUG logs? I don't think this can be resolved on SO. — Matthias J. Sax, Mar 22 '19 at 17:42
previously we faced with an issue of a long app restart with kafka streams (without spring cloud stream) if we had aggregations with persistent state stores and enabled logging (`withLoggingEnabled = true`). after we decided to disable logging and changed to in-memory state store, app restart became much much faster. here you need to understand that potentially you might lose data on sudden app crash, but for some use cases it's ok. — Vasyl Sarzhynskyi, Apr 11 '19 at 06:43

score 2 · Answer 1 · answered Apr 15 '19 at 11:16

We think we solved the issue. The different instances each got their own persistent volume. When restarting the instances, it seems that some or sometimes all instances got linked to other persistent volumes instead of the once they were previously using. This caused the state stores to become obsolete and restoration process to kick in. We solved this by utilizing NFS to share the persistent volumes in a way that all instances would point to the same state store directory structure. This seems to solve the issue

Streaming application with state stores takes up to 1 hour to restart

1 Answers1