1

On a Kafka Broker, it's recommended to use multiple drives for the message logs to improve throughput. That's why they have a log.dirs property that can have multiple directories that will be assigned to partitions in a round-robin fashion.

We have a lot of installations that we already setup this way for event-driven kafka applications, where we have like 4 nodes with 5 disks each.

Now we want to use Kafka-Streams with a Key-Value store where we persist computed data for fast range queries. We see that Kafka-Streams maps the partitions 1-on-1 to multiple statestores, and creates a separate subdirectory for each one.

However, we can't configure how to spread those subdirectories across different disks. We can only configure a single parent directory as 'state.dir' (StreamsConfig.STATE_DIR_CONFIG).

Is there a configuration I am missing? Or is having multiple disks not so relevant for Kafka Streams?

GeertPt
  • 16,398
  • 2
  • 37
  • 61

1 Answers1

0

It's not really relevant, but this must be handled at the OS level via RAID configurations, for example.

Or you can implement the StateStore interface and write your own provider that can use multiple disks (or remote distributed filesystems)

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Interesting that you mention remote distributed filesystems. Would it be possible to keep the state in a shared directory? So when partitions are rebalanced, they can reuse the stored data from the partition that was previously assigned to a different node. Or would that result in corrupt data, and I need a different folder per node? – GeertPt Mar 17 '20 at 13:53
  • I'm not really sure... You can look at this as an example https://github.com/andreas-schroeder/redisks – OneCricketeer Mar 18 '20 at 06:57