Can a consumer read records from a partition that stores data of particular key value?

Question

Instead of creating many topics I'm creating a partition for each consumer and store data using a key. So is there a way to make a consumer in a consumer group read from partition that stores data of a specific key. If so can you suggest how it can done using kafka-python (or any other library).

Maybe https://stackoverflow.com/questions/45940171/how-to-force-a-consumer-to-read-a-specific-partition-in-kafka helps you. — Seyed Morteza Mousavi, Feb 26 '19 at 05:13
yeah i checked assign but then it requires the partition no. But is there a way i can assign using key instead of partition no, since the user has to manually identify the partition no and assign. — user119100, Feb 27 '19 at 06:34

score 0 · Answer 1 · answered Feb 26 '19 at 06:42

Instead of using the subscription and the related consumer group logic, you can use the "assign" logic (it's provided by the Kafka consumer Java client for example). While with subscription to a topic and being part of a consumer group, the partitions are automatically assigned to consumers and re-balanced when a new consumer joins or leaves, it's different using assign. With assign, the consumer asks to be assigned to a specific partition. It's not part of any consumer group. It's also mean that you are in charge of handling rebalancing if a consumer dies: for example, if consumer 1 get assigned partition 1 but at some point it crashes, the partition 1 won't be reassigned automatically to another consumer. It's up to you writing and handling the logic for restarting the consumer (or another one) for getting messages from partition 1.

score 0 · Answer 2 · answered Feb 27 '19 at 10:51

I believe that what you try to achieve is not the best practice in long term perspective.

If I understood, your need is to determine the partition the consumer will connect to, based on the key of the message.

I guess the publishers are using the "default partitioner".

Technically you might be able to determine the topic partition by reusing in the consumer the same algorithm used in the consumer. Here the java code of the DefaultPartitioner. You could adapt it in Python.

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    int numPartitions = partitions.size();
    if (keyBytes == null) {
        int nextValue = nextValue(topic);
        List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
        if (availablePartitions.size() > 0) {
            int part = Utils.toPositive(nextValue) % availablePartitions.size();
            return availablePartitions.get(part).partition();
        } else {
            // no partitions are available, give a non-available partition
            return Utils.toPositive(nextValue) % numPartitions;
        }
    } else {
        // hash the keyBytes to choose a partition
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }
}

private int nextValue(String topic) {
    AtomicInteger counter = topicCounterMap.get(topic);
    if (null == counter) {
        counter = new AtomicInteger(ThreadLocalRandom.current().nextInt());
        AtomicInteger currentCounter = topicCounterMap.putIfAbsent(topic, counter);
        if (currentCounter != null) {
            counter = currentCounter;
        }
    }
    return counter.getAndIncrement();
}

The important part in your usecase, when a key is set is :

Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

And the Utils.murmur2 method:

public static int murmur2(final byte[] data) {
    int length = data.length;
    int seed = 0x9747b28c;
    // 'm' and 'r' are mixing constants generated offline.
    // They're not really 'magic', they just happen to work well.
    final int m = 0x5bd1e995;
    final int r = 24;

    // Initialize the hash to a random value
    int h = seed ^ length;
    int length4 = length / 4;

    for (int i = 0; i < length4; i++) {
        final int i4 = i * 4;
        int k = (data[i4 + 0] & 0xff) + ((data[i4 + 1] & 0xff) << 8) + ((data[i4 + 2] & 0xff) << 16) + ((data[i4 + 3] & 0xff) << 24);
        k *= m;
        k ^= k >>> r;
        k *= m;
        h *= m;
        h ^= k;
    }

    // Handle the last few bytes of the input array
    switch (length % 4) {
        case 3:
            h ^= (data[(length & ~3) + 2] & 0xff) << 16;
        case 2:
            h ^= (data[(length & ~3) + 1] & 0xff) << 8;
        case 1:
            h ^= data[length & ~3] & 0xff;
            h *= m;
    }

    h ^= h >>> 13;
    h *= m;
    h ^= h >>> 15;

    return h;
}

Why do I think it is not the best solution?

If you add a new partition to your topic, the DefaultPartitioner will provide you a partition id that may differ from the partition id returned before you added the new partition. And by default, existing messages are not repartitioned, meaning that you will have messages with the same key on different partitions.

And the same behavior occurs on the consumer side. After updating the number of partitions, the consumer would try to consume messages from a different partition. You would miss messages from the previous partition used for this key.

Can a consumer read records from a partition that stores data of particular key value?

2 Answers2