I am doing a research on optimal number of partitions for a kafka cluster, considering different scenarios with different number of brokers but specific context
Considering this particular use-case, where we have:
- Round-robin partitioning (not keyed)
- We don't plan to scale the cluster (change the number of brokers, neither up nor down)
- The throughput of the stream won't change, neither the producing nor the consuming parts
- The number of consumers won't change
Then, in this context, to determine the optimal number of partitions considering a variable amount of brokers, I think it breaks to just two factors: The throughput, and the number of brokers.
In this question they mention the relation between throughput and partition number. So let's say that we do our calculation and it gives us X
number of partitions to match up our throughput needs.
But considering we have Y
number of brokers, then the number of partitions to assure that all brokers are used should be: MAX(X,Y)
, shouldn't it?
Because if by our throughput needs it turns out that we need just 6 partitions (X=6) but our cluster has 10 brokers, then we should at least use all brokers and set 10 partitions, right? Else there would be 4 brokers not doing anything and just billing if we set 6 partitions on a 10 brokers cluster.
That's my understanding so far but I think I might be oversimplifying it.
Also, does the replica factor plays a role in this decision on number of partitions? Or is that completely independent? My guess is that it's independent since it just copies-pastes partition's data across other brokers. But I can also be simplifying it