Parallelism at Kafka Topics or Partitions Level

Question

In order to seperate my data, based on a key: Should I use multiple topics or multiple partitions within same topic? I'm asking on basis of overheads, computation, data storage and load caused on server.

score 0 · Answer 1 · answered Sep 15 '15 at 08:23

I would recommend to separate (partition) your data into multiple partitions within the same topic. I assume the data logically belongs together (for example a stream of click events). The advantage of partitioning your data using multiple partitions within the same topic is mainly that all Kafka APIs are implemented to be used like this.

Splitting your data into the topics would probably lead to much more code in the producer and consumer implementations.

score 0 · Answer 2 · answered Sep 16 '15 at 06:25

As suggested by @rmetzger splitting records into multiple topic would increase the complexity at the producer level however there might be some other factors worth considering.

In Kafka the main level of parallelism is number of partitions in a topic, because having so you can spawn that many number of consumer instance to keep reading data from the same topic in parallel.

E.g if you have a separate topic based on the event having N number of partition then while consuming you will be able to create N number of consumer instances each dedicated to consume from a specific partitions concurrently. But in that case the ordering of the messages in not guaranteed.i.e. ordering of the messages is lost in the presence of parallel consumption

On the other hand keeping the records within same topic in a separate partition will make this a lot easier to implement and consumer messages in order (Kafka only provides a total order over messages within a partition, not between different partitions in a topic.). But you will be limited to run only one consumer process in that case.

Parallelism at Kafka Topics or Partitions Level

2 Answers2