Instead of creating many topics I'm creating a partition for each consumer and store data using a key. So is there a way to make a consumer in a consumer group read from partition that stores data of a specific key. If so can you suggest how it can done using kafka-python (or any other library).
-
What have you tried so far? – bigbounty Feb 26 '19 at 05:00
-
Have you tried `assign` instead of `subscribe`? – Seyed Morteza Mousavi Feb 26 '19 at 05:09
-
Maybe https://stackoverflow.com/questions/45940171/how-to-force-a-consumer-to-read-a-specific-partition-in-kafka helps you. – Seyed Morteza Mousavi Feb 26 '19 at 05:13
-
yeah i checked assign but then it requires the partition no. But is there a way i can assign using key instead of partition no, since the user has to manually identify the partition no and assign. – user119100 Feb 27 '19 at 06:34
2 Answers
Instead of using the subscription and the related consumer group logic, you can use the "assign" logic (it's provided by the Kafka consumer Java client for example). While with subscription to a topic and being part of a consumer group, the partitions are automatically assigned to consumers and re-balanced when a new consumer joins or leaves, it's different using assign. With assign, the consumer asks to be assigned to a specific partition. It's not part of any consumer group. It's also mean that you are in charge of handling rebalancing if a consumer dies: for example, if consumer 1 get assigned partition 1 but at some point it crashes, the partition 1 won't be reassigned automatically to another consumer. It's up to you writing and handling the logic for restarting the consumer (or another one) for getting messages from partition 1.

- 9,431
- 1
- 30
- 45
I believe that what you try to achieve is not the best practice in long term perspective.
If I understood, your need is to determine the partition the consumer will connect to, based on the key of the message.
I guess the publishers are using the "default partitioner".
Technically you might be able to determine the topic partition by reusing in the consumer the same algorithm used in the consumer. Here the java code of the DefaultPartitioner. You could adapt it in Python.
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyBytes == null) {
int nextValue = nextValue(topic);
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
return Utils.toPositive(nextValue) % numPartitions;
}
} else {
// hash the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
private int nextValue(String topic) {
AtomicInteger counter = topicCounterMap.get(topic);
if (null == counter) {
counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
if (currentCounter != null) {
counter = currentCounter;
}
}
return counter.getAndIncrement();
}
The important part in your usecase, when a key is set is :
Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
And the Utils.murmur2
method:
public static int murmur2(final byte[] data) {
int length = data.length;
int seed = 0x9747b28c;
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.
final int m = 0x5bd1e995;
final int r = 24;
// Initialize the hash to a random value
int h = seed ^ length;
int length4 = length / 4;
for (int i = 0; i < length4; i++) {
final int i4 = i * 4;
int k = (data[i4 + 0] & 0xff) + ((data[i4 + 1] & 0xff) << 8) + ((data[i4 + 2] & 0xff) << 16) + ((data[i4 + 3] & 0xff) << 24);
k *= m;
k ^= k >>> r;
k *= m;
h *= m;
h ^= k;
}
// Handle the last few bytes of the input array
switch (length % 4) {
case 3:
h ^= (data[(length & ~3) + 2] & 0xff) << 16;
case 2:
h ^= (data[(length & ~3) + 1] & 0xff) << 8;
case 1:
h ^= data[length & ~3] & 0xff;
h *= m;
}
h ^= h >>> 13;
h *= m;
h ^= h >>> 15;
return h;
}
Why do I think it is not the best solution?
If you add a new partition to your topic, the DefaultPartitioner
will provide you a partition id
that may differ from the partition id
returned before you added the new partition. And by default, existing messages are not repartitioned, meaning that you will have messages with the same key on different partitions.
And the same behavior occurs on the consumer side. After updating the number of partitions, the consumer would try to consume messages from a different partition. You would miss messages from the previous partition used for this key.

- 1,129
- 1
- 13
- 31