I have a pyspark streaming job that streams a directory from s3 (using textFileStream
). Each line of input is parsed and output to parquet format on hdfs.
This works great under normal circumstances. However, what kind of options do I have for recovery of lost batches of data when one of the following error conditions occurs?
- An exception occurs in the driver inside a call to
foreachRDD
, where output operations occur (possiblyHdfsError
, or a spark sql exception during output operations such as partitionBy ordataframe.write.parquet()
). As far as I know, this is classified as an "action" in Spark (vs. "transformation"). - An exception occurs in an executor, perhaps because an exception occurred in a map() lambda while parsing a line.
The system I am building must be a system of record. All of my output semantics conform to the spark streaming documentation for exactly-once output semantics (if a batch/RDD has to be recomputed, output data will be overwritten, not duplicated).
How do I handle failures in my output action (inside foreachRDD
)? AFAICT, exceptions that occur inside foreachRDD
do not cause the streaming job to stop. In fact, I've tried to determine how to make unhandled exceptions inside foreachRDD
to stop the job, and have been unable to do so.
Say an unhandled exception occurs in the driver. If I need to make a code change to resolve the exception, my understanding is that I would need to delete the checkpoint before resuming. In this scenario, is there a way to start the streaming job in the past from the timestamp at which the streaming job stopped?