We are building a fault tolerant system using Spark Streaming and Kafka and are testing checkpointing spark streaming to give us the option of restarting the spark job if it crashes for any reason. Here's what our spark process looks like:
- Spark Streaming runs every 5 seconds (slide interval) to read data from kafka
- Kafka is receiving about 80 messages per second
What we want to achieve is a setup where we can bring down the spark streaming job (to mimic a failure) and then restart it and still ensure that we process every message from Kafka. This seems to work fine but, here is what I see that I don't know what to make of:
- After we restart the Spark job a batch is created for all the lost time. So for e.g. if we bring down and restart after a minute 12 batches are created (one for every 5 seconds). Please see the image below
- None of these batches are processing in any data. As you can see in the image below, these batches have an input size = 0. We have to wait for all these to complete before the batches with data start getting processed. This gets worse if we restart the job after a gap of hours as hundreds of batches are created that don't process anything but have to complete
Any inputs on this will be appreciated:
- Is this expected? Why are batches being created when they don't process any data (the kafka topic is receiving messages continuously).
- There is also a second thing which is confusing. After we bring the spark process down for a minute and restart there are 4800 (80*60) messages in the kafka topic waiting to be processed. It looks like that these messages are being processed but I don;t see any batch on the UI that has an input size of 4800