6

If I use a persistent store when materializing a KTable, will the state store be persistent across application restarts? For example, if I use the following:

StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier storeSupplier =      Stores.persistentKeyValueStore("queryable-store-name");
 KTable<Long,String> table = builder.table(
   "foo",
   Materialized.as(storeSupplier)
               .withKeySerde(Serdes.Long())
               .withValueSerde(Serdes.String())

Will the state store "queryable-store-name" be accessible with state from previous runs on a restart? Lets say, I send 50 records to topic foo and it gets materialized in the state store. Then the application gets restarted, will I still have those 50 records in the state store? If not, is there a way to achieve that?

Thanks!

sobychacko
  • 5,099
  • 15
  • 26

1 Answers1

8

Yes, state store is by default persisted on disk. When applications is restarted and application-id wasn't changed, state will be recovered from disk, containing all 50 records. New records will be added from offset when application was killed/stopped/restarted.

Edit: Seems like you're missing aggregation operation on top of the KTable, this is required. See my code example:

final KStream<CustomerKey, ViewPage> viewPagesStream=builder.stream(INPUT_TOPIC);

final KTable<Windowed<ViewPageCountKey>,Long>uniqueViewPageCount=viewPagesStream
        .map((key,value)->{
            ViewPageCountKey newKey=new ViewPageCountKey(key.getProjectId(),value.getUrl());
            return new KeyValue<>(newKey,value);
        })
        .filter((key,value)->key!=null)
        .groupByKey()
        .count(TimeWindows.of(WINDOW_SIZE).advanceBy(WINDOW_ADVANCE),STORE_NAME);
Matus Cimerman
  • 417
  • 2
  • 10
  • 3
    Recommended reading: https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html – Matus Cimerman Jul 20 '18 at 12:36
  • Thanks for the answer. Will I be able to query the state store when I restart from the previous data? I don't see that working as you describe, maybe something I am doing wrong then. – sobychacko Jul 20 '18 at 13:14
  • 10
    Side remark: even if you use an in-memory store, or loose the state on disk, it will be recovered from the changelog topic before processing resumes. As long as logging is enabled, your store is fully fault-tolerant. Note, that a persistent store without logging enabled is not fully fault-tolerant. The idea of a persistent store is to allow state that is larger than main-memory and quicker startup time because the store does not need to be rebuild from the changelog topic. However, the local store data on disk is not written for fault-tolerance reason -- that's the changelog topic's purpose. – Matthias J. Sax Jul 20 '18 at 21:17
  • Thanks @MatthiasJ.Sax. I have a somewhat follow up question here: https://stackoverflow.com/questions/51461416/aggregration-and-state-store-retention-in-kafka-streams. Will greatly appreciate if you can chime in on that. – sobychacko Jul 22 '18 at 01:50