0

I want to use key/value pattern writing to Kafka in order to keep the same order of data writing while reading it. My question is should the number of partitions in the topic be equal to the number of different keys in the incoming data. I already know that with the key/value pattern data having the same key will go to the same partition.

Hence if number of partitions is not equal to the number of different keys in data, we can have data having different keys in the same partition? In this case how data order is kept?

scalacode
  • 1,096
  • 1
  • 16
  • 38

2 Answers2

0

From Kafka docs:

Each partition is an ordered, immutable sequence of records that is continually appended to a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.


Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.


A consumer instance sees records in the order they are stored in the log.

These are basic rules about Kafka and sending messages with different keys to same partition will not change this. You can even send all messages to same partition still the first message will be appended to the log before subsequent ones and will have lower offset value. Therefore order will be preserved.

H.Ç.T
  • 3,335
  • 1
  • 18
  • 37
0

My question is should the number of partitions in the topic be equal to the number of different keys in the incoming data.

I don't think that this is generally a good idea. It totally depends on the data you are processing. In case you have a fixed amount of keys (such as female, male and diverse) it might make sense. However, even then you need to be careful as the this could lead to an unbalance of data load over the broker as there might be less diverse. So you could end up having most of the data in one partition whereas the other partition(s) would be left empty. In general, the amount of partitions should be adjusted to your throughput requirements.

Hence if number of partitions is not equal to the number of different keys in data, we can have data having different keys in the same partition? In this case how data order is kept?

Yes, you could end up having different key in the same partition. Then the ordering is kept for this particular partition but not guaranteed in the topic overall.So assume, you have the keys A, B, and C and a topic with two partitions. A and C goes to the first partition and B is stored in the second partition. If data is flowing like this: A/V1, A/V2, B/V1, C/V1, B/V2

Then your partitions will be filled like this:

  • partition0: A/V1, A/V2, C/V1
  • partition1: B/V1, B/V2

When consuming this topic it is not clear how the ordering between A and C messages are in relation to B messages. However, it is always guaranteed that the message A/V1 is consumed before A/V2, A/V2 before C/V1, and B/V1 before B/V2.

If you are looking for a more flexible way of directing your messages into partitions you can also think of writing a custom partitioner.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77