2

I have a pyspark streaming job that streams a directory from s3 (using textFileStream). Each line of input is parsed and output to parquet format on hdfs.

This works great under normal circumstances. However, what kind of options do I have for recovery of lost batches of data when one of the following error conditions occurs?

  • An exception occurs in the driver inside a call to foreachRDD, where output operations occur (possibly HdfsError, or a spark sql exception during output operations such as partitionBy or dataframe.write.parquet()). As far as I know, this is classified as an "action" in Spark (vs. "transformation").
  • An exception occurs in an executor, perhaps because an exception occurred in a map() lambda while parsing a line.

The system I am building must be a system of record. All of my output semantics conform to the spark streaming documentation for exactly-once output semantics (if a batch/RDD has to be recomputed, output data will be overwritten, not duplicated).

How do I handle failures in my output action (inside foreachRDD)? AFAICT, exceptions that occur inside foreachRDD do not cause the streaming job to stop. In fact, I've tried to determine how to make unhandled exceptions inside foreachRDD to stop the job, and have been unable to do so.

Say an unhandled exception occurs in the driver. If I need to make a code change to resolve the exception, my understanding is that I would need to delete the checkpoint before resuming. In this scenario, is there a way to start the streaming job in the past from the timestamp at which the streaming job stopped?

octagonC
  • 635
  • 1
  • 6
  • 11

1 Answers1

4

Generally speaking every exception thrown inside function passed to mapPartitions-like operation (map, filter, flatMap) should be recoverable. There is simply no good reason for a whole action / transformation to fail on a single malformed input. Exact strategy will depend on your requirements (ignore, log, keep for further processing). You can find some ideas in What is the equivalent to scala.util.Try in pyspark?

Handling operation-wide failure is definitely harder. Since in general it can be not recoverable or waiting can be not an option due to incoming traffic I would optimistically retry in case of failure and if it doesn't succeed push raw data to an external backup system (S3 for example).

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • I think your stated approach will work well for failures in a map/filter/flatMap/etc. However say that an exception occurs in my driver, due to some bug. The checkpoint allows me to resume the driver where it left off, but AFAIK cannot support code changes prior to resume. In this scenario is it possible for me to make a code change (affecting the driver) and then resume the streaming job from the timestamp that it failed (in the past)? – octagonC Nov 13 '15 at 19:19
  • @octagonC The reason it's not supported to resume streaming from a checkpoint with "code changes" is that checkpoints store serialized Java/Scala objects. This may very well not apply to PySpark, which pickles everything. So I think resuming from a checkpoint with _only_ python code changed might very well work. I suggest you give it a try. – Blake Miller Mar 19 '16 at 04:11