21

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?

I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

Samy Dindane
  • 17,900
  • 3
  • 40
  • 50

3 Answers3

33

I think your problem can be solved by Spark Streaming Backpressure.

Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.

By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.

From Apache Spark Kafka configuration

spark.streaming.backpressure.enabled:

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate:

This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.

This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.

Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

VladoDemcak
  • 4,893
  • 4
  • 35
  • 42
  • That's what I was looking for, thank you. Unfortunately, neither spark.streaming.backpressure.initialRate, spark.streaming.backpressure.enabled, spark.streaming.receiver.maxRate nor spark.streaming.receiver.initialRate change how many records I get (I tried many different combinations). The only configuration that works is "spark.streaming.kafka.maxRatePerPartition". That's better than nothing, but I'd be useful to have backpressure enabled for automatic scaling. Do you have any idea about why aren't backpressure working? How to debug this? – Samy Dindane Oct 12 '16 at 14:29
  • Maybe `spark.streaming.backpressure.initialRate` works but as as Jeroen van Wilgenburg noticed in his blog "It’s a good idea to set a maximum becasue the backpressure algorithm isn’t instant (which would be impossible)... trouble with a job with Kafka input that could handle about 1000 events/sec when Kafka decided to give us 50.000 records/sec in the first few seconds." .. but I am confused cause didnt work.`spark.streaming.backpressure.enabled` should "Internally, dynamically sets the maximum receiving rate of receivers" – VladoDemcak Oct 12 '16 at 18:27
  • On the latest streaming docs, it mentions that setting `spark.streaming.backpressure.enabled` takes care of the rates dynamically. "In Spark 1.5, we have introduced a feature called backpressure that eliminate the need to set this rate limit, as Spark Streaming automatically figures out the rate limits and dynamically adjusts them if the processing conditions change.", which could explain why the rates didn't work if the property was set to true. – MrChristine Oct 13 '16 at 01:31
  • I am using spark 1.6.1 with createStream API I am also not able to take advantage of spark.streaming.backpressure.enabled=true It is not working for me either, the only setting worked for me is spark.streaming.receiver.maxRate – nilesh1212 Nov 24 '16 at 10:39
  • Do this approach work for Spark Structured Streaming? – Praneeth Ramesh Oct 07 '19 at 21:08
  • @VladoDemcak Is it possible to limit the send speed of Kafka Producer for Spark Streaming? – jay Wong Nov 01 '19 at 03:22
11

Apart from above answers. Batch size is product of 3 parameters

  1. batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
  2. spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
  3. No of partitions in kafka topic

For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)

samthebest
  • 30,803
  • 25
  • 102
  • 142
Vikki
  • 171
  • 1
  • 6
  • This answer is more exact that the accepted answer. – samthebest Oct 24 '18 at 13:49
  • This really make sense. Thank you for clarification. But my situation is little different. In my job, I am consuming from multiple topics. So, problem is : NumberOfTopics X NumberOfPartitions X MaxRatePerPartition X BatchDuration I wanted to set max of this. Using Spark 2.4 with Direct Kafka Stream. – Hrishikesh Mishra Feb 24 '20 at 15:56
0

Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.

By settings below properties, we could control the batch size spark.streaming.receiver.maxRate= spark.streaming.kafka.maxRatePerPartition=

You could even dynamically set the batch size based on processing time, by enabling the back pressure spark.streaming.backpressure.enabled:true spark.streaming.backpressure.initialRate: