How does offset work when I have multiple topics on one partition in Kafka?

Question

I am trying to develop a better understanding of how Kafka works. To keep things simple, currently I am running Kafka on one Zookeeper with 3 brokers and one partition with duplication factor of 3. I learned that, in general, it's better to have number of partitions ~= number of consumers.

Question 1: Do topics share offsets in the same partition?

I have multiple topics (e.g. dogs, cats, dinosaurs) on one partition (e.g. partition 0). Now my producers have produced a message to each of the topics. "msg: bark" to dogs, "msg: meow" to cats and "msg: rawr" to dinosaurs. I noticed that if I specify dogs[0][0], I get back bark and if I do the same on cats and dinosaurs, I do get back each message respectively. This is an awesome feature but it contradicts with my understanding. I thought offset is specific to a partition. If I have pushed three messages into a partition sequentially. Shouldn't the messages be indexed with 0, 1, and 2? Now it seems me that offset is specific to a topic.

This is how I imagined it

['bark', 'meow', 'rawr']

In reality, it looks like this

['bark']
['meow']
['rawr']

But that can't be it. There must be something keeping track of offset and the actual physical location of where the message is in the log file.

Question 2: How do you manage your messages if you were to have multiple partitions for one topic?

In question 1, I have multiple topics in one partition, now let's say I have multiple partitions for one topic. For example, I have 4 partitions for the dogs topic and I have 100 messages to push to my Kafka cluster. Do I distribute the messages evenly across partitions like 25 goes in partition 1, 25 goes in partition 2 and so on...?

If a consumer wants to consume all those 100 messages at once, he/she needs to hit all four partitions. How is this different from hitting 1 partition with 100 messages? Does network bandwidth impose a bottleneck?

Thank you in advance

score 4 · Accepted Answer · answered Sep 18 '17 at 22:37

For your question 1: It is impossible to have multiple topics on one partition. Partition is part of topic conceptually. You can have 3 topics and each of them has only one partition. So you have 3 partitions in total. That explains the behavior that you observed.

For your question 2: AT the producer side, if a valid partition number is specified that partition will be used when sending the record. If no partition is specified but a key is present, a partition will be chosen using a hash of the key. If neither key nor partition is present a partition will be assigned in a round-robin fashion. Now the number of partitions decides the max parallelism. There is a concept called consumer group, which can have multiple consumers in the same group consuming the same topic. In the example you gave, if your topic has only one partition, the max parallelism is one and only one consumer in the consumer group will receive messages (100 of them). But if you have 4 partitions, you can have up to 4 consumers, one for each partition and each receives 25 messages.

AHHH I see! I misread the configuration file, there is a property called `num.partitions` in server.properties. I thought it meant the total number of partition this server is allowed to have. After your explanation, I looked at it again, I then realized it's number of partition PER TOPIC. So in short, Kafka does the the distribution for me. — mofury, Sep 18 '17 at 23:12
Actually I have another follow up question. Each partition can have duplication. Let's say for my partition 0, I have 3 in-sync replicas plus the leader. Does that count as I have 4 partitions and I can enable 4 consumers to consume them in parallel? Thank you! — mofury, Sep 18 '17 at 23:40
No. Only the leader partition handles all read and write requests for the partition. — Lan, Sep 19 '17 at 01:48

How does offset work when I have multiple topics on one partition in Kafka?

1 Answers1