2

I have Spark 1.4 Streaming application, which reads data from Kafka, uses statefull transformation, and has batch interval of 15 seconds.

In order to use statefull transformations, as well as recover from driver failures, I need to set checkpointing on streaming context.

Also, in the Spark 1.4 documentation they recommend DStream checkpointing 5-10 times of the batch interval.

So my questions is:

What happens if I only set checkpointing on spark streaming context? I guess DStreams will be checkpointed every batch interval?

What If I set both checkpointing on streaming context as well as the moment I read data from Kafka, I set:

DStream.checkpoint(90 seconds)

What will be the intervals for metadata checkpointing and what for data checkpointing (meaning DStreams)?

Thank you.

Srdjan Nikitovic
  • 853
  • 2
  • 9
  • 19

1 Answers1

2

I guess DStreams will be checkpointed every batch interval?

No, Spark will checkpoint your data every batch interval multiplied by a constant. This means that if your batch interval is 15 seconds, data will be checkpointed every multiple of 15 seconds. In mapWithState, for example, which is a stateful stream, you can see the batch interval is multiplied by 10:

private[streaming] object InternalMapWithStateDStream {
  private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}

What will be the intervals for metadata checkpointing and what for data checkpointing (meaning DStreams)?

If you set the checkpoint duration to 90 seconds on the DStream, then that'll be your checkpoint duration, meaning every 90 seconds the data will get checkpointed. You cannot set the checkpoint duration directly on the StreamingContext, all you can do with it is pass the checkpoint directory. The overload of checkpoint only takes a String:

/**
 * Set the context to periodically checkpoint the DStream operations for driver
 * fault-tolerance.
 * @param directory HDFS-compatible directory where the checkpoint
 *        data will be reliably stored.
 *        Note that this must be a fault-tolerant file system like HDFS.
 */
def checkpoint(directory: String)

Edit

For updateStateByKey, it seems that the time for checkpointing is set to be the batch time multiplied by Seconds(10) / slideDuration:

// Set the checkpoint interval to be slideDuration or 10 seconds,
// which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
  checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
  logInfo(s"Checkpoint interval automatically set to $checkpointDuration")
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thank you for your answer. DO you know what is the checkpoint interval for updateStateByKey trasnfromation? – Srdjan Nikitovic Jun 09 '16 at 13:35
  • Ok, so for my 15 seconds batch interval, checkpointing interval is: 15*math.celi(10/15) = 15 seconds. Do you think it is optimal or I should consider manually changing the interval (setting it to be larger)? – Srdjan Nikitovic Jun 09 '16 at 14:15
  • @SrdjanNikitovic That really depends on your job, how much incoming data you have, how many worker nodes processing, etc. – Yuval Itzchakov Jun 09 '16 at 14:21
  • Thank you a lot for the answers. – Srdjan Nikitovic Jun 09 '16 at 14:27
  • sorry for being naive, but if the checkpoint duration is multiple of batch duration and not every batch.. that what happens the batch that comes in between durations ? – Monu Oct 25 '20 at 03:00
  • @Monu That means if you had, for example, batch time of 30 seconds, and you only checkpoint every 5 minutes, and your application crashed, you'd be reprocessing some of that data over again. – Yuval Itzchakov Oct 25 '20 at 09:51
  • I understand the fault tolerance aspect of it and I don't want to go into specific. I am more interested in business-specific use cases For example.. to have an average word count for the last 2 hours.. As stated in the above example of 90 seconds checkpoint .. my understanding is that to be able to checkpoint 90 seconds data, the spark should have all incoming batches within 90 seconds in the memory so that it can checkpoint all the rdd. if it's true What happens if one does not mention the checkpoint directory? spark will have all rdd in the memory.. or it just has infinite lineage? – Monu Oct 25 '20 at 15:22