I'm using Kafka Streams 3.0.0, Spring Cloud Stream and Micrometer to surface the metrics to /actuator/prometheus
. The metric I'm referring to is kafka_stream_state_block_cache_capacity
, which I believe is equivalent to the block-cache-capacity
metric from this Confluent document.
I came across this Medium article which mentions that a static cache within the RocksDB config setter class will be executed per StreamThread. I also came across a Confluent document that says by using static, the memory usage across all instances can be bounded.
In my setup, one application is handling more than one Kafka topic partition. However, from the metric, I see that the different partitions are assigned to the same StreamThread and the memory is multiplied by the number of partitions that the application is handling.
For example, if my application handles two Kafka partitions, and my RocksDB config is as shown below:
ublic class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
static {
RocksDB.loadLibrary();
}
private static long lruCacheBytes = 100L * 1024L * 1024L; //100MB
private static long memtableBytes = 1024 * 1024;
private static int nMemtables = 1;
private static long writeBufferManagerBytes = 95 * 1024 * 1024;
private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(lruCacheBytes);
private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(writeBufferManagerBytes, cache);
@Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig)
options.tableFormatConfig();
tableConfig.setBlockCache(cache);
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setWriteBufferManager(writeBufferManager);
// These options are recommended to be set when bounding the total memory
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true);
tableConfig.setPinTopLevelIndexAndFilter(true);
options.setMaxWriteBufferNumber(nMemtables);
options.setWriteBufferSize(memtableBytes);
options.setTableFormatConfig(tableConfig);
}
@Override
public void close(final String storeName, final Options options) {}
}
I can see an entry like this in /actuator/prometheus
:
kafka_stream_state_block_cache_capacity{kafka_version="3.0.0",rocksdb_window_state_id="one.minute.window.count",spring_id="stream-builder-process",task_id="0_0",thread_id="7ed0af6a-244f-4b87-b4cf-f2f311df976c-StreamThread-1",} 2.097152E8
kafka_stream_state_block_cache_capacity{kafka_version="3.0.0",rocksdb_window_state_id="one.minute.window.count",spring_id="stream-builder-process",task_id="0_1",thread_id="7ed0af6a-244f-4b87-b4cf-f2f311df976c-StreamThread-1",} 2.097152E8
The entry above shows two tasks (because it handles two Kafka topic partitions), and each task using ~200MB.
My understanding is that each task uses ~200MB because the static cache is set to 100MB and based on this StackOverflow answer, there will be 2 segments created where each segment relates to a state store. Therefore ~100MB * 2 = ~200MB.
Also, since there are two entries for the kafka_stream_state_block_cache_capacity
metric, one per task, it means that my application uses a total of ~200MB * 2 = ~400MB.
Is my understanding of memory allocation correct and is the static cache allocated per partition, instead of per Stream Thread?