I'm using Kafka and Kafka Streams as part of Spring Cloud Stream. The data that is flowing in my Kafka Streams app is being aggregated and materialized by certain time windows:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> oneHour = Materialized.as("one-hour-store");
oneHour.withLoggingEnabled(topicConfig);
events
.map(getStringSensorMeasurementKeyValueKeyValueMapper())
.groupByKey()
.windowedBy(TimeWindows.of(oneHourStore.getTimeUnit()))
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue),
(oneHour));
As designed the information that is being materialized is also backed by a changelog topic.
Our app also has a rest endpoint that will query the statestore like this:
ReadOnlyWindowStore<String, Double> windowStore = queryableStoreRegistry.getQueryableStoreType("one-hour-store", QueryableStoreTypes.windowStore());
WindowStoreIterator<ErrorScore> iter = windowStore.fetch(key, from, to);
Looking at the settings of the changelog topic that is created it reads:
min.insync.replicas 1
cleanup.policy delete
retention.ms 5259600000
retention.bytes -1
I would assume that the local state store would at least keep the information for 61 days (~2 months). However it seems that only about the last day of data remains in the stores.
What could cause the data being removed so soon?
Update with solution The Kafka Streams version 2.0.1 does not contain the Materialized.withRetention method. For this particular version I was able to set the retention time of the state stores using the following code which solves my problem:
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
making my code be written like:
...
.groupByKey()
.windowedBy(timeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue),
(oneHour));
...