0

From the kafka faq page

In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key

So all the messages with a particular key will always go to the same partition in a topic:

  1. How does the consumer know which partition the producer wrote to, so it can consume directly from that partition?
  2. If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered so that the consumers can consume messages from specific producers?
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Gadam
  • 2,674
  • 8
  • 37
  • 56

2 Answers2

1

How does the consumer know which partition the producer wrote to

Doesn't need to, or at least shouldn't, as this would create a tight coupling between clients. All consumer instances should be responsible for handling all messages for the subscribed topic. While you can assign a Consumer to a list of TopicPartition instances, and you can call the methods of the DefaultPartitioner for a given key to find out what partition it would have gone to, I've personally not run across a need for that. Also, keep in mind, that Producers have full control over the partitioner.class setting, and do not need to inform Consumers about this setting.

If there are more producers than partitions, and multipe producers are writing to the same partition, how are the offsets ordered...

Number of producers or partitions doesn't matter. Batches are sequentially written to partitions. You can limit the number of batches sent at once per Producer client (and you only need one instance per application) with max.in.flight.requests, but for separate applications, you of course cannot control any ordering

so that the consumers can consume messages from specific producers?

Again, this should not be done.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • "Again, this should not be done." - so this basically means the consumer should go through all the messages in the topic, even from other producers it doesn't care about before it finds what it needs? – Gadam Jul 28 '21 at 19:36
  • Consumers are very very very fast, in our deployment we can go over all topic with thousands/millions of records in matter of seconds and filter based on condition – Ran Lupovich Jul 28 '21 at 20:37
  • @Gadam The assumption is that if there is a topic being produced to, all consumers ought to be able to read that, yes. Like I said, you can assign consumers to partitions, but not specific records that the producers have written... This pattern is very similar to any RDBMS - you have clients that write rows, then you can query all records, or select certain columns, but it doesn't matter who writes the data – OneCricketeer Jul 29 '21 at 14:35
  • 1
    In RDBMS you can have index only access and in kafka you are ought to do full table scan, what's your thought about that? – Ran Lupovich Jul 29 '21 at 22:14
0

Kafka is distributed event streaming, one of its use cases is decoupling services from producers to consumers, the producer producing/one application messages to topics and consumers /another application reads from topics,

If you have more then one producer, the order that data would be in the kafka/topic/partition is not guaranteed between producers, it will be the order of the messages that are written to the topic, (even with one producer there might be issues in ordering , read about idempotent producer)

The offset is atomic action which will promise that no two messages will get same offset.

The offset is running number, it has a meaning only in the specific topic and specfic partition

If using the default partioner it means you are using murmur2 algorithm to decide to which partition to send the messages, while sending a record to kafka that contains a key , the partioner in the producer runs the hash function which returns a value, the value is the number of the partition that this key would be sent to, this is same murmur2 function, so for the same key, with different producer you'll keep getting same partition value

The consumer is assigned/subscribed to handle topic/partition, it does not know which key was sent to each partition, there is assignor function which decides in consumer group, which consumer would handle which partition

Ran Lupovich
  • 1,655
  • 1
  • 6
  • 13
  • then what is the purpose of the producer controlling which partition the message needs to go to? (using default or custom partitioner) – Gadam Jul 28 '21 at 19:37
  • https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning – Ran Lupovich Jul 28 '21 at 20:49
  • If you have so much load that you need more than a single instance of your application, you need to partition your data. How you partition serves as your load balancing for the downstream application. The producer clients decide which topic partition that the data ends up in, but it’s what the consumer applications do with that data that drives the decision logic. If possible, the best partitioning strategy to use is uncorrelated/random. However, you may need to partition on an attribute of the data – Ran Lupovich Jul 28 '21 at 20:49
  • The benefit is that you will handle in one consumer process all the related information/records that belongs to a certain key – Ran Lupovich Jul 28 '21 at 20:51
  • Another thought, when thinking about multiple producers , the picture that you need to have in mind is IoT , so each producer is responsible for its own "key" and information, anyways having multiple producers, and if order is matter, its best to let each producer habdle different set of keys – Ran Lupovich Jul 29 '21 at 06:29