1

If I have an application that publishes events on a kafka topic and my consumers need to read the data in the order they were published, then my topic can have only one partition, since kafka guarantees ordering only within partitions.

However, I read that kafka uses partitioning to provide scalability, i.e. by placeing partitions of a topic on several brokers. I also read, that a partition itself can not be split.

Since ordering is only possible within a partition, is scalability a problem for my application? Is there a way to deal with this problem or is my understanding of Kafka not right?

Imagine my application has thousands of consumers (each in a single group so everyone consumes the published events). All need to read data from that single topic with that single partition.

EDIT: Another thing that comes to my mind is: Imagine having 5 partitions of that topic, and all consumers must still read the right ordering. If the publishers dont specify an partition id or a key, then kafka will publish the information round-robin on the 5 partitions right?

If all consumers are in a single group and all subscribe to the topic, then each consumer reads events of all topics, which means that they would still get the ordered messages, right?

L.Gashi
  • 183
  • 1
  • 11
  • If records have no keys, yes they are written in round robin, by default, but they are then not ordered in any consistent way for a consumer to read – OneCricketeer Feb 27 '22 at 14:15
  • @L.Gashi can you please approve my answer if you are satisfied with it and if you do think that's the useful information to you. – Nirav Chhatrola Feb 27 '22 at 15:21
  • @OneCricketeer do you mean that (in the case of round-robin) events would i.e. be published on partitions 1->2->3, but one consumer may read 3->1->2 while others may read 1->2->3 or 3->2->1 ? – L.Gashi Feb 27 '22 at 17:02
  • 1
    That's correct. Also don't think there's any guarantee producer will write in sequential order – OneCricketeer Feb 28 '22 at 12:02

1 Answers1

1

Point 1) If your requirement is to process all records in sequence only than its not possible using parallel processing as no where parallel processing guarantees the sequence.

Point 2) Yes in kafka sequence will only be guarantee with all the records sends with same key. So analyse data if related data can be segregated where you truly required sequence processing. and send only those related data with same key. and send other related data with another key.

Point 3) Now if you are able to segregate your data in with different keys than you will have to increase no of partitions. and accordingly consumers as well. so for e.g. you have 3 partitions than you can scale your application with 3 consumers. (note that you are producing records with key to obey your sequencing). all 3 consumers assign with 1 partitions and your parallel processing will be achieved. (That will only guarantee of processing in sequence of records with same key).

Point 4)

Imagine my application has thousands of consumers (each in a single group so everyone consumes the published events). All need to read data from that single topic with that single partition.

if your all(thousands) of consumer reading in same group and reading from single partition topic than only one consumer will be assigned with one partition and rest all(thousands - 1) consumers will be sitting idle doing nothing.

if you assign different groups to all consumers than all consumers will be assigned with that single partition topic and all consumers individually process all records so there will be duplicate processing.

point 5)

If all consumers are in a single group and all subscribe to the topic, then each consumer reads events of all topics, which means that they would still get the ordered messages, right?

No as describe in point 4) its not guarantee that all records will be in order as its being processed by different consumers.

Summary : If you can gather records and send it with same key where you actually required sequencing than that will guarantee sequencing. If your requirement is to consume all the records in sequence only than its problem of sequence processing only, and parallel processing can not be achieved here.

Nirav Chhatrola
  • 482
  • 4
  • 15
  • I dont understand why messages would not be ordered. I know that Kafka guarantees ordering only within one partition. However imagine having a Topic with 5 partitions, 3 publishers and 3 subscribers. When the publishers publish events (without a key or id), then kafka would spread the events roundrobin on the 5 partitions. Now if every consumer is in a single group, they would all get all 5 partitions assigned. If kafka decides to publish the events on partition 1 then 2 then 3 etc. do you mean that one consumer might read the order 3->2->1 while other may read 2->1->3 and so on? – L.Gashi Feb 27 '22 at 17:00
  • lets say you have 1 topic with 5 partition, and 3 consumer in same group, than first of all , consumers will be assigned with partitions, so lets say C1=P1,P4, C2=P2,P5, C3=P3. now one it starts processing all consumers will be proceeding parallel . so it cant guarantees sequence. – Nirav Chhatrola Mar 01 '22 at 03:14