I am trying to develop a better understanding of how Kafka works. To keep things simple, currently I am running Kafka on one Zookeeper with 3 brokers and one partition with duplication factor of 3. I learned that, in general, it's better to have number of partitions ~= number of consumers.
Question 1: Do topics share offsets in the same partition?
I have multiple topics (e.g. dogs
, cats
, dinosaurs
) on one partition (e.g. partition 0). Now my producers have produced a message to each of the topics. "msg: bark"
to dogs
, "msg: meow"
to cats
and "msg: rawr"
to dinosaurs
. I noticed that if I specify dogs[0][0]
, I get back bark
and if I do the same on cats
and dinosaurs
, I do get back each message respectively. This is an awesome feature but it contradicts with my understanding. I thought offset is specific to a partition. If I have pushed three messages into a partition sequentially. Shouldn't the messages be indexed with 0, 1, and 2? Now it seems me that offset is specific to a topic.
This is how I imagined it
['bark', 'meow', 'rawr']
In reality, it looks like this
['bark']
['meow']
['rawr']
But that can't be it. There must be something keeping track of offset and the actual physical location of where the message is in the log file.
Question 2: How do you manage your messages if you were to have multiple partitions for one topic?
In question 1, I have multiple topics in one partition, now let's say I have multiple partitions for one topic. For example, I have 4 partitions for the dogs
topic and I have 100 messages to push to my Kafka cluster. Do I distribute the messages evenly across partitions like 25 goes in partition 1, 25 goes in partition 2 and so on...?
If a consumer wants to consume all those 100 messages at once, he/she needs to hit all four partitions. How is this different from hitting 1 partition with 100 messages? Does network bandwidth impose a bottleneck?
Thank you in advance