Kafka partitioning for Spark Streaming

Question

I am using kafka with Spark Streaming (2.2.0). The load on the system is dynamic and I am trying to understand how to handle auto scaling. There are two aspects of auto scaling:

Auto scale the computing infra
Auto scale the application components to take advantage of the auto scaled infra

Infra auto scaling: There can be various well defined trigger points for scaling infra. One of the possible ones in my case shall be the latency or delay in processing messages arriving at kafka. So, I can monitor the kafka cluster and if the messages processing is delayed by more than a certain factor then I know that more computing power needs to be thrown in.

Application auto scaling: In the above scenario, lets say that I add one more nore to the Spark cluster once I feel that messages by being held up in Kafka for long. A new worker starts and registers with the master and thus Spark cluster has more horse power available. There are two ways of making use of this addtional horse power. One strategy could be to repartition the kafka topic by adding more partitions. Once I do that Spark cluster will pull more messages in parallel during next batch and thus the processing speed may go up. The other strategy could be not to repartition kafka topic but add more cores to existing executors so that the message processing time goes down and thus more messages may be processed from an individual partition in same time.

I am not sure if the above strategy is correct or there are other ways of handling such scenarios?

score 1 · Answer 1 · answered Apr 17 '18 at 10:05

add more cores to existing executors so that the message processing time goes down and thus more messages may be processed from an individual partition in same time.

Spark doesn't work like that. Each partition is normally processed by a single thread. Adding more cores might give you a performance boost only if there some tasks queued waiting for executors.

Might, because CPU is not the only resource that matters. Adding more cores won't help if bottleneck is for example network.

One strategy could be to repartition the kafka topic by adding more partitions. Once I do that Spark cluster will pull more messages in parallel during next batch and thus the processing speed may go up.

This will help if Spark cluster has enough resources to process additional partitions. Otherwise they will just for their share of resources.

Also, adding partitions alone might not be a solution, if you don't scale Kafka cluster at the same time.

Finally your comment:

Now in the code I could be doing a repartition of this RDD to speed up processing.

Unless processing is heavy, repartitioning will cost more than just processing the data.

So what is the answer?

Scaling only one component can achieve constant throughput if resources are unbalanced.
If resources are balanced, you might have to scale all interacting components.
Before you do, make sure that you correctly identified bottleneck.

score 0 · Answer 2 · answered Apr 17 '18 at 05:32

0

Even if you scale up your infrastructure, the number of parallel consumers are the order of the number of partitions in your topic. So the right way is to increase the number of partitions as and when required. If you feel the need for scaling up your infra, you can do so as well.

answered Apr 17 '18 at 05:32

Selvaram G

727
5
18

I understand that parallel consumers in Spark shall be equal to kafka topic partitions. But having more horse power with each executor may help in processing individual message from a partition at a faster rate. Thus reducing the overall message processing delay. Don't you agree? – scorpio Apr 17 '18 at 07:56
I think it comes down to horizontal scaling vs vertical scaling. You are suggesting vertical scaling by increasing the number of cores. Hypothetically there is an upper limit on the number of cores per instance and also I am not sure how will you dynamically scale the consumers based on load. Whereas if you choose to increase the number of partitions and throw more instances to it, the scaling problem becomes easy. Shortly, yes it does help but other way is simpler and has less downtime. – Selvaram G Apr 17 '18 at 08:06
I am not suggesting vertical scaling. Lets say that kafka topic has 10 partitions and thus Spark creates an RDD with 10 partitions. Now in the code I could be doing a repartition of this RDD to speed up processing. Once I add more executors by adding more nodes to the cluster, Spark will have more executors and more partitions of repartitioned RDD may be processed in parallel (provided the RDD partition count > number of executors previously available). Thoughts? – scorpio Apr 17 '18 at 08:17

Kafka partitioning for Spark Streaming

2 Answers2