Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

Question

I'm running a Spark Streaming application that reads data from Kafka. I have activated checkpointing to recover the job in case of failure.

The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch. This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.

Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.

Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?

Many thanks.

bistaumanga · Accepted Answer · 2016-06-23T11:03:38.117

1

If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.

i.e if you believe your system/app can handle 10K events per ~~minute~~ second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf

spark.streaming.kafka.maxRatePerPartition 5000

or add it inside your code :

val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")

Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.

conf.set("spark.streaming.backpressure.enabled","true")

update: There was a mistake, The configuration is for number of seconds per seconds not per minute.

edited Jun 23 '16 at 11:03

answered Jun 22 '16 at 14:04

bistaumanga

1,476
4
16
23

Unfortunately this didn't work for me. I've enabled back pressure and set conf.set("spark.streaming.kafka.maxRatePerPartition", "50000") and I have two Kafka partitions (so it should be a rate of 100.000) but when I restart the job after 10 minutes it tries to execute more than 500.000 events in one micro batch. – Erica Jun 23 '16 at 10:27
@nicola I did a mistake in above post, the configuration you specify is per second, not per minute... by 5000 i mean 5000 events per second per partition..... if you specify it to 50000, you are asking for a rate of 50K * 2 * 60 = 6M events per minute – bistaumanga Jun 23 '16 at 10:59
You are right. Now it works. Thanks! Setting "spark.streaming.kafka.maxRatePerPartition" did the trick! – Erica Jul 04 '16 at 14:18

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

1 Answers1