41

I have a data flow use case where I want to have topics defined based on each of the customer repositories (which might be in the order of 100,000s) Each data flow would be a topic with partitions (in the order of a few 10s) defining the different stages of the flow.

Is Kafka good for a scenario like this? If not how would I remodel my use case to handle such scenarios. Also it is the case that each customer repository data cannot be mingled with others even during processing.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
Swami PR
  • 793
  • 1
  • 8
  • 15

2 Answers2

48

Update March 2021: With Kafka's new KRaft mode, which entirely removes ZooKeeper from Kafka's architecture, a Kafka cluster can handle millions of topics/partitions. See https://www.confluent.io/blog/kafka-without-zookeeper-a-sneak-peek/ for details.

*short for "Kafka Raft Metadata mode"; in Early Access as of Kafka v2.8


Update September 2018: As of Kafka v2.0, a Kafka cluster can have hundreds of thousands of topics. See https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions.


Initial answer below for posterity:

The rule of thumb is that the number of Kafka topics can be in the thousands.

Jun Rao (Kafka committer; now at Confluent but he was formerly in LinkedIn's Kafka team) wrote:

At LinkedIn, our largest cluster has more than 2K topics. 5K topics should be fine. [...]

With more topics, you may hit one of those limits: (1) # dirs allowed in a FS; (2) open file handlers (we keep all log segments open in the broker); (3) ZK nodes.

The Kafka FAQ gives the following abstract guideline:

Kafka FAQ: How many topics can I have?

Unlike many messaging systems Kafka topics are meant to scale up arbitrarily. Hence we encourage fewer large topics rather than many small topics. So for example if we were storing notifications for users we would encourage a design with a single notifications topic partitioned by user id rather than a separate topic per user.

The actual scalability is for the most part determined by the number of total partitions across all topics not the number of topics itself (see the question below for details).

The article http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ (written by the aforementioned Jun Rao) adds further details, and particularly focuses on the impact of the number of partitions.

IMHO your use case / model is a bit of a stretch for a single Kafka cluster, though not necessarily for Kafka in general. With the little information you shared (I understand that a public forum is not the best place for sensitive discussions :-P) the only off-the-hip comment I can provide you with is to consider using more than one Kafka cluster because you mentioned that customer data must be very much isolated anyways (including the processing steps).

I hope this helps a bit!

miguno
  • 14,498
  • 3
  • 47
  • 63
  • Thanks @miguno. I will take the suggestion of using multiple Kafka clusters. My followup question will then be. Would it be a good idea to spawn new Kafka clusters once I reach say a constant 2000 topics? Using like Mesos to host the Kafka clusters? – Swami PR Oct 19 '15 at 20:07
  • 2
    I think yes, it is a good idea to start thinking about how you could easily manage multiple Kafka clusters easily -- not only with regards to deploying (spawning) but also with regards to monitoring (e.g. to determine when spawning new clusters actually makes sense in your situation). – miguno Oct 21 '15 at 13:33
  • Are you able to provide some clarity on the update? "Update Sep 2018: Today, as of Kafka v2.0, a Kafka cluster can have hundreds of thousands of topics." - I could not find anything about this in the documentation – GeorgeWilson Jun 14 '19 at 01:30
  • 1
    Ah, good point. I updated the answer. For a direct response, see https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions. – miguno Jun 14 '19 at 07:18
  • 1
    Thanks @MichaelG.Noll for this great explanation. I read thoroughly the post on Confluent's blog about Kafka v2.8 (btw, great release!). Back in 2015, you suggested to create 1 topic partitioned by userId rather than a separate topic per user. Is it still valid with Kafka v2.8? Today we can create more topics with good performances, but I read that the performance with a high number of partitions has also drastically increased! In 2022, What would be your choice for the author's use case? – Julien Elkaim Feb 28 '22 at 11:43
  • I am also interested in this topic. Because kafka gives you persistence and transient nature all in one box. – iam thadiyan May 14 '22 at 13:45
  • With today's Kafka, which supports millions of partitions/topics, you can actually experiment with the scenario of using one-topic-per-user. – miguno May 26 '22 at 09:10
0

Consider that Kafka is a compelling choice within the network but it was not designed to effectively and efficiently (but yes quickly) distribute data to hundreds of thousands of consumers over the last mile - via sometimes congested and unreliable web mobile and satellite networks. Inserting a or alternatively using a real time API management platform allows the same data aggregation but is also purpose-built to maximize efficient, selective, and highly scalable data distribution outside the corporate network. A real-time api management solution handles the challenges of these networks and manages hundreds of thousand of discrete topics required with ease and without massive amounts of added infrastructure.