I am using kafka with Spark Streaming (2.2.0). The load on the system is dynamic and I am trying to understand how to handle auto scaling. There are two aspects of auto scaling:
- Auto scale the computing infra
- Auto scale the application components to take advantage of the auto scaled infra
Infra auto scaling: There can be various well defined trigger points for scaling infra. One of the possible ones in my case shall be the latency or delay in processing messages arriving at kafka. So, I can monitor the kafka cluster and if the messages processing is delayed by more than a certain factor then I know that more computing power needs to be thrown in.
Application auto scaling: In the above scenario, lets say that I add one more nore to the Spark cluster once I feel that messages by being held up in Kafka for long. A new worker starts and registers with the master and thus Spark cluster has more horse power available. There are two ways of making use of this addtional horse power. One strategy could be to repartition the kafka topic by adding more partitions. Once I do that Spark cluster will pull more messages in parallel during next batch and thus the processing speed may go up. The other strategy could be not to repartition kafka topic but add more cores to existing executors so that the message processing time goes down and thus more messages may be processed from an individual partition in same time.
I am not sure if the above strategy is correct or there are other ways of handling such scenarios?