1

I'm trying to understand how the internals of Kafka Streams works with respects to cache and RocksDB (state store).

        KTable<Windowed<EligibilityKey>, String> kTable = kStreamMapValues
                .groupByKey(Grouped.with(keySpecificAvroSerde, Serdes.String())).windowedBy(timeWindows)
                .reduce((a, b) -> b, materialized.withLoggingDisabled().withRetention(Duration.ofSeconds(retention)))
                .suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(timeToWaitForMoreEvents),
                        Suppressed.BufferConfig.unbounded().withLoggingDisabled()));

With the above portion of my topology, I'm consuming from a Kafka topic with 300 partitions. The application is deployed on OpenShift with a memory allocation of 4GB. I noticed the memory of the application constantly increasing until eventually an OOMKILLED occurs. After some research I've read that a custom RocksDB config was something I should implement, because the default size is too big for my application. Records first enter a cache (configured by CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG) and then enter a state store.

public class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {

  private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(1 * 1024 * 1024L, -1, false, 0);
  private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(1 * 1024 * 1024L, cache);

  @Override
  public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {

    BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();

    // These three options in combination will limit the memory used by RocksDB to the size passed to the block cache (TOTAL_OFF_HEAP_MEMORY)
    tableConfig.setBlockCache(cache);
    tableConfig.setCacheIndexAndFilterBlocks(true);
    options.setWriteBufferManager(writeBufferManager);

    // These options are recommended to be set when bounding the total memory
    tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true);
    tableConfig.setPinTopLevelIndexAndFilter(true);
    tableConfig.setBlockSize(2048L);
    options.setMaxWriteBufferNumber(2);
    options.setWriteBufferSize(1 * 1024 * 1024L);

    options.setTableFormatConfig(tableConfig);
  }

  @Override
  public void close(final String storeName, final Options options) {
    // Cache and WriteBufferManager should not be closed here, as the same objects are shared by every store instance.
  }
}

With each time window segment, three segments are created by default. If I'm consuming from 300 partitions, since 3 time window segments will be created for each, 900 instances of RocksDB are created. Is my understanding correct that the following is true?

 Memory allocated in OpenShift / RocksDB instances => 4096MB / 900 => 4.55 MB
 
 (WriteBufferSize * MaxWriteBufferNumber) + BlockCache + WriteBufferManager => (1MB * 2) + 1MB + 1MB => 4MB

Is the BoundedMemoryRocksDBConfig.java for each instance of RocksDB, or for all?

1 Answers1

1

If you consume from a topic with 300 partitions and you use a segmented state store, i.e., you use a time window in the DSL, you will end up with 900 RocksDB instances. If you only use one Kafka Streams client, i.e., you do not scale out, all 900 RocksDB instances will end up on the same computing node.

The BoundedMemoryRocksDBConfig limits the memory RocksDB uses per Kafka Streams client. That means, if you only use one Kafka Streams client, BoundedMemoryRocksDBConfig limits the memory for all 900 instances.

Is my understanding correct that the following is true?

Memory allocated in OpenShift / RocksDB instances => 4096MB / 900 => 4.55 MB

(WriteBufferSize * MaxWriteBufferNumber) + BlockCache + WriteBufferManager => (1MB * 2) + 1MB + 1MB => 4MB

No, that is not correct.

If you pass the Cache to WriteBufferManager, the size needed for memtables are also counted against the cache (see footnote 1 in the docs of the BoundedMemoryRocksDBConfig and the RocksDB docs). So, the size that you pass to the cache is the limit for memtables and block cache. Since you pass the cache and the write buffer manager to all of your instances on the same computing node, all 900 instances are bounded by the size you pass to the cache. For example, if you specify a size of 4 GB the total memory used by all 900 instances (assuming one Kafka Streams client) is bounded to 4 GB.

Be aware that the size passed to the cache is not a strict limit. Although the boolean parameter in the constructor of the cache gives you the option to enforce a strict limit, the enforcement does not work if the write buffer memory is also counted against the cache due to a bug in the RocksDB version Kafka Streams uses.

With Kafka Streams 2.7.0, you will have the possibility to monitor RocksDB memory consumption with metrics that are exposed via JMX. Have a look at KIP-607 for more details.

Bruno Cadonna
  • 1,348
  • 7
  • 11
  • Thanks, Bruno. What happens when all instances hit the cache limit? Do older records get released from the state store and flowed to the rest of the topology? I saw something about writing to disk which is not what I want. – mikeayonguyen Dec 04 '20 at 16:50
  • 1
    RocksDB is a persistent key-value store. It will always write everything to disk. It just keeps data structures in memory to improve write and read performance. To not exceed the memory limits RocksDB needs to write data to disk. – Bruno Cadonna Dec 04 '20 at 17:07
  • Okay. I guess what I'm trying to get at is if this application is on OpenShift and RocksDB continues to write to disk to avoid exceeding memory limits eventually the memory allocated to the application will fill up regardless. – mikeayonguyen Dec 04 '20 at 17:14
  • 1
    Why would the memory fill up regardless RocksDB moving data from memory to disk? Is there something special with OpenShift that I do not know? – Bruno Cadonna Dec 04 '20 at 17:20
  • I must be confused. What I meant to say is I have 2GB memory assigned to that application. My cache limit is 1GB (512 MB of which is the WriteBufferManager). Will the application ever write more than the 2GB I've assigned to the application? I don't have a persistence storage allocated to the application/pod. – mikeayonguyen Dec 04 '20 at 17:34
  • 1
    If you do not have a persistence storage allocated to the application/pod, then you cannot use RocksDB because RocksDB is a persistent key-value store. In other words, RocksDB needs a disk. In Kafka Streams you can also use in-memory state stores but if they run out of disk your application crashes with an OOM error. – Bruno Cadonna Dec 07 '20 at 09:55
  • 1
    Regarding your other question about the application writing more than a given amount of data into the state store. A Kafka Streams application writes as much as it needs to the state store. It has some ways to clean up state store if that is possible due to the semantics of the operations, e.g., for windowed state stores. In general it is the responsibility of the user to ensure that the state stays in given limits, e.g., by using windowing or cleaning up state stores manually. – Bruno Cadonna Dec 07 '20 at 11:25
  • I am using Stores.inMemoryWindowStore. I have the logging disabled because I cannot create interal topics. I noticed that the local state stores are written to the following folder located /tmp/kafka-streams/. Say I allocate a persistent storage to that directory. Does RocksDB clean up older files in that directory, or does it grow indefinitely? – mikeayonguyen Dec 07 '20 at 14:44
  • 1
    If you use Stores.inMemoryWindowStore, you do not use RocksDB at all. – Bruno Cadonna Dec 07 '20 at 16:36
  • I have commented out the portion of passing Stores.inMemoryWindowStore into the Materialized object. So it should be using RocksDB state store by default. I still have the .withRetention() set. Does that mean the local state stores at /tmp/kafka-streams will abide by that and clear with respects to the retention set? – mikeayonguyen Dec 07 '20 at 16:48
  • 1
    That sounds correct. It will remove the segments (i.e. RocksDB instances) of the state store that contain records that exceed the time interval (stream time) - (retention period). Just to be clear, if you store your state store in a temp folder and disable logging you will lose your state each time your pod crashes/stops. – Bruno Cadonna Dec 07 '20 at 20:17
  • Even if I have a persistent volume at /tmp/kafka-streams? – mikeayonguyen Dec 08 '20 at 14:38
  • 1
    Without logging the state cannot be recreated locally if your stateful task migrates to another Streams client/computing node. On the computing node it depends how your OS cleans up its temp folders. – Bruno Cadonna Dec 08 '20 at 15:06
  • And because the state cannot be recreated, it will take a while to create again. This could result in lost records right because the offset for the consumer has incremented already? – mikeayonguyen Dec 08 '20 at 15:27
  • 1
    Yes, the old state will be lost. – Bruno Cadonna Dec 08 '20 at 20:19
  • Last remark. If I have a persistent volume claim on /tmp/kafka-streams, and if those state stores persist between deployments of the Kafka Streams application, are changelog topics required? From what you said previously, my understanding is changelog topics are only required to rebuild the state stores, but because I have a persistent volume claim on /tmp/kafka-streams, they are in tact after each deployment. – mikeayonguyen Dec 11 '20 at 22:10
  • 1
    Do you mean, you share the same state directory on all computing nodes? That might lead to issues with locked directories because each Kafka Streams client will try to clean up state that it does not own, but it cannot know that another Kafka Streams client accesses the same state directory. If you were talking about each computing node having its own state directory than you need the changelog topic to recreate a state of a task when the task migrated to that computing node and the computing node does not have the state. – Bruno Cadonna Dec 14 '20 at 10:59
  • I only have one computing node/Kafka Streams client, so the persistent volume claim is only for that node/client. Whenever I redeploy the pod and check /tmp/kafka-streams, all the state stores are still there and their timestamps are from previous days (meaning the PVC saved them). I keep a close eye on the logs during start up and there's nothing logged about creating state stores, however the first log I see pertaining to a state store is a log that says it's opening one. That led me to believe it's not creating any from scratch but using the state stores saved in the PVC. – mikeayonguyen Dec 14 '20 at 16:16
  • 1
    Not related to your use case but for future reference, I following statement is not correct: "That might lead to issues with locked directories because each Kafka Streams client will try to clean up state that it does not own, but it cannot know that another Kafka Streams client accesses the same state directory." That would not happen, because of locks on operating system level. Nevertheless, it is not recommended to use the same state directory for multiple Kafka Streams clients on the same computing node because Kafka Streams assumes that the clients are located on separate nodes. – Bruno Cadonna Dec 16 '20 at 15:38
  • @BrunoCadonna Is my understanding about the max off heap size for rocksdb correct ? https://stackoverflow.com/questions/65814205/kafka-streams-limiting-off-heap-memory – SunilS Jan 20 '21 at 17:16