8

There is a very small but very powerful detail in the Kafka org.apache.kafka.clients.producer.internals.DefaultPartitioner implementation that bugs me a lot.

It is this line of code:

return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

to be more precise, the last % numPartitions. I keep asking myself what is the reason behind introducing such a huge constraint by making the partition ID a function of the number of existent partitions? Just for the convenience of having small numbers (human readable/traceable?!) in comparison to the total number of partitions? Does anyone here have a broader insight into the issue?

I'm asking this because in our implementation, the key we use to store data in kafka is domain-sensitive and we use it to retrieve information from kafka based on that. For instance, we have consumers that need to subscribe ONLY to partitions that present interest to them and the way we do that link is by using such keys.

Would be safe to use a custom partitioner that doesn't do that modulo operation? Should we notice any performance degradation. Does this have any implications on the Producer and/or Consumer side?

Any ideas and comments are welcome.

Suh Fangmbeng
  • 573
  • 4
  • 16
nucatus
  • 2,196
  • 2
  • 21
  • 18

3 Answers3

20

Partitions in a Kafka topic are numbered from 0...N. Thus, if a key is hashed to determine a partitions, the result hash value must be in the interval [0;N] -- it must be a valid partition number.

Using modulo operation is a standard technique in hashing.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Well, I see this as a constraint that shouldn't have been exposed outside of kafka. One should be able to create a partition with the ID = the hash (w/o the modulo), so that a one-to-one relationship between the key and the partition could be created, no matter what the total number of partitions is. Internally, kafka could maintain a counter and map the key hash to the incremented value of the counter for every new key, if the 0...N sequence was so critical to the internal logic. – nucatus Oct 01 '16 at 14:02
  • Well. It's called "partitionier" and the method JavaDoc says "Compute the partition for the given record." https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html – Matthias J. Sax Oct 01 '16 at 15:02
  • 1
    I would say the documentation is misleading. The correct one should be "Computes the partition for the given record and for the given number of partitions." If any of these changes, the relationship doesn't hold. – nucatus Oct 01 '16 at 16:43
  • `Internally, kafka could maintain a counter and map the key hash to the incremented value of the counter for every new key` NO. A 4-byte key could have 4 billion distinct values, @nucatus you can not keep a map of every value to some incremented result. With larger keys, the result would be worse, you would have to transmit this map to every consumer (and its additions) before a consumer could consume. All the keys ever seen could easily be GB's of data. – Scott Carey Dec 27 '19 at 19:23
  • Note, that it is the _client_ that decides which partition a message is sent to. You can use one of the partitioners. Or explicitly send any given message to any partition. A partition could represent a priority, a customer, a tenant... The hash/modulo approach simply spreads the messages across available partitions. If the keys are somewhat distributed, then this is a decent approach for load balancing. Choose which even partitioning style is appropriate for your message flow and consumption. – AndyPook Jan 31 '20 at 15:28
4

Normally you do modulo on hash to make sure that the entry will fit in the hash range.

Say you have hash range of 5.

 -------------------                                                                                   
| 0 | 1 | 2 | 3 | 4 |                                                                                  
 -------------------  

if your hashcode of entry happens to be 6 you would have to divide by number of available
buckets so that it fits in the range, means bucket 1 in this case.

Even more important thing is when you decide to add or remove bucket from the range.
Say you decreased the size of hashmap to 4 buckets, then the last bucket will be inactive and
you have to rehash the values in bucket#4 to next bucket in clockwise direction. (I'm talking
about consistent hashing here)

Also, new coming hashes need to be distributed within active 4 buckets, because 5th one will go away, this is taken care by the modulo.

The same concept is used in distributed systems for rehashing which happens when you add or remove node to your cluster.

Kafka Default Partiotioner is using modulo for the same purpose. If you add or remove partitions, which is very usual case if you ask me, for example during high volume of incoming messages I might want to add more partitions so that I achieve high write throughput and also high read throughput, as I can parallely consume partitions.

You can override partitioning algorithm based on your business logic by choosing some key in your message which will make sure the messages are distributed uniformly within the range[0...n]

prayagupa
  • 30,204
  • 14
  • 155
  • 192
1

The performance impact of using a custom partitioner entirely depends on your implementation of it.

I'm not entirely sure what you're trying to accomplish though. If I understand your question correctly, you want to use the value of the message key as the partition number directly, without doing any modulo operation on it to determine a partition?

In that case all you need to do is use the overloaded constructor for the ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value) when producing a message to a kafka topic, passing in the desired partition number. This way all the default partitioning logic will be bypassed entirely and the message will go to specified partition.

etiescu
  • 93
  • 7
  • Well, this is another route I was considering, but in this case we need to maintain the key -> partitionId mapping stored somewhere else (zookeeper?) so that when another message with the same key is coming, to send it to the same partition. We tried to avoid this path, since for every produced message, you need to make a trip to this map. – nucatus Oct 01 '16 at 13:56
  • 1
    I see. Unfortunately, no matter what approach you take, if you need to map messageKeys to arbitrary partitions and you can't obtain the partition number dynamically through some sort of arithmetic operation on the messageKey, you're going to add overhead to the system, whether or not this logic is captured in a custom partitioner or done elsewhere. – etiescu Oct 01 '16 at 14:40