I'm trying to understand how the internals of Kafka Streams works with respects to cache and RocksDB (state store).
KTable<Windowed<EligibilityKey>, String> kTable = kStreamMapValues
.groupByKey(Grouped.with(keySpecificAvroSerde, Serdes.String())).windowedBy(timeWindows)
.reduce((a, b) -> b, materialized.withLoggingDisabled().withRetention(Duration.ofSeconds(retention)))
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(timeToWaitForMoreEvents),
Suppressed.BufferConfig.unbounded().withLoggingDisabled()));
With the above portion of my topology, I'm consuming from a Kafka topic with 300 partitions. The application is deployed on OpenShift with a memory allocation of 4GB. I noticed the memory of the application constantly increasing until eventually an OOMKILLED occurs. After some research I've read that a custom RocksDB config was something I should implement, because the default size is too big for my application. Records first enter a cache (configured by CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG) and then enter a state store.
public class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(1 * 1024 * 1024L, -1, false, 0);
private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(1 * 1024 * 1024L, cache);
@Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
// These three options in combination will limit the memory used by RocksDB to the size passed to the block cache (TOTAL_OFF_HEAP_MEMORY)
tableConfig.setBlockCache(cache);
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setWriteBufferManager(writeBufferManager);
// These options are recommended to be set when bounding the total memory
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true);
tableConfig.setPinTopLevelIndexAndFilter(true);
tableConfig.setBlockSize(2048L);
options.setMaxWriteBufferNumber(2);
options.setWriteBufferSize(1 * 1024 * 1024L);
options.setTableFormatConfig(tableConfig);
}
@Override
public void close(final String storeName, final Options options) {
// Cache and WriteBufferManager should not be closed here, as the same objects are shared by every store instance.
}
}
With each time window segment, three segments are created by default. If I'm consuming from 300 partitions, since 3 time window segments will be created for each, 900 instances of RocksDB are created. Is my understanding correct that the following is true?
Memory allocated in OpenShift / RocksDB instances => 4096MB / 900 => 4.55 MB
(WriteBufferSize * MaxWriteBufferNumber) + BlockCache + WriteBufferManager => (1MB * 2) + 1MB + 1MB => 4MB
Is the BoundedMemoryRocksDBConfig.java for each instance of RocksDB, or for all?