2

We are building a fault tolerant system using Spark Streaming and Kafka and are testing checkpointing spark streaming to give us the option of restarting the spark job if it crashes for any reason. Here's what our spark process looks like:

  • Spark Streaming runs every 5 seconds (slide interval) to read data from kafka
  • Kafka is receiving about 80 messages per second

What we want to achieve is a setup where we can bring down the spark streaming job (to mimic a failure) and then restart it and still ensure that we process every message from Kafka. This seems to work fine but, here is what I see that I don't know what to make of:

  • After we restart the Spark job a batch is created for all the lost time. So for e.g. if we bring down and restart after a minute 12 batches are created (one for every 5 seconds). Please see the image below
  • None of these batches are processing in any data. As you can see in the image below, these batches have an input size = 0. We have to wait for all these to complete before the batches with data start getting processed. This gets worse if we restart the job after a gap of hours as hundreds of batches are created that don't process anything but have to complete

Any inputs on this will be appreciated:

  • Is this expected? Why are batches being created when they don't process any data (the kafka topic is receiving messages continuously).
  • There is also a second thing which is confusing. After we bring the spark process down for a minute and restart there are 4800 (80*60) messages in the kafka topic waiting to be processed. It looks like that these messages are being processed but I don;t see any batch on the UI that has an input size of 4800

enter image description here

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Shay
  • 505
  • 1
  • 3
  • 19

2 Answers2

1

Is this expected? Why are batches being created when they don't process any data

That's what Sparks fault tolerance semantics guarantee, that even if your service fails, it can pick up from the last processed point in time and continue processing. Spark is reading the checkpointed data and is initiating the recovery process until it reaches the current point in time. Spark isn't aware of 0 event batches, and thus does nothing to optimize them away.

It looks like that these messages are being processed but I don't see any batch on the UI that has an input size of 4800

This may happen for various reasons. A common one is if you have Sparks back pressure flag set to true. Spark sees that you have a significant processing delay, so it reduces the number of messages read per batch in order to allow the streaming job to catch up.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
0

Is this expected? Why are batches being created when they don't process any data

In fact, Spark streaming with Kafka, when recovering from checkpoint, spark first generate jobs. All data are processed in one or more batches(it depends on some conf), while in web ui, you could only see all those recovered batches executed with 0 event.

There is also a second thing which is confusing ...

Yeah, from web ui, that's confusing. Try to count events num for each batch, println the num.

DStream.foreachRDD(println(_.count))

You'll find that Spark do really process batches created by checkpoint, while in web ui, events num 0.

If your application finds it difficult to process all events in one batch after recovering from failure ,how to control the num of batches created by spark?

Try to search spark.streaming.kafka.maxRatePerPartition => Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API.

MaxRatePerPartition* partitionOfKafka* YourStreamingBatchDuration * N = eventsNumToProcess

N => After recovered from checkpoint, the num of batches spark need to process.
Yulin GUO
  • 161
  • 2
  • 3