Fixing checkpointing in spark structured streaming

Question

I'm having an issue with checkpointing in production when spark can't find a file from _spark_metadata folder

18/05/04 16:59:55 INFO FileStreamSinkLog: Set the compact interval to 10 [defaultCompactInterval: 10]
18/05/04 16:59:55 INFO DelegatingS3FileSystem: Getting file status for 's3u://data-bucket-prod/data/internal/_spark_metadata/19.compact'
18/05/04 16:59:55 ERROR FileFormatWriter: Aborting job null.
java.lang.IllegalStateException: s3u://data-bucket-prod/data/internal/_spark_metadata/19.compact doesn't exist when compacting batch 29 (compactInterval: 10)

There was already a question asked but no solution for now.

In checkpointing folder I see that batch 29 is not committed yet, so can I remove something from checkpointing's sources, state and/or offsets to prevent spark from failing because of missing _spark_metadata/19.compact file?

score 4 · Answer 1 · edited Aug 20 '19 at 13:46

4

The problem is that you are storing your checkpoints to S3. Checkpointing in S3 isn't 100% reliable. To read about the exact reason why S3 isn't reliable read this article.

Solution 1: Use HDFS to store checkpoints

Solution 2: Use EFS if you want to use Amazon Web Services. The above article provides all the steps in full detail to set up EFS.

Solution 3: Use NFS

edited Aug 20 '19 at 13:46

Jacek Laskowski

72,696
27
242
420

answered Jun 13 '18 at 22:32

Harshit Sharma

313
4
19

Thanks for the link to blog post! However, we decided to move to AWS Glue and use their bookmarking mechanism instead – Yuriy Bondaruk Jun 14 '18 at 11:51
I can't mark it as right answer since the question is about fixing existing checkpointing and the answer more about the cause of the problem and alternative solutions. Anyway, it's useful – Yuriy Bondaruk Jun 14 '18 at 21:20
Fixing existing checkpointing structure requires you to change the behavior of AWS S3. Currently Spark functions on read-after-write semantics. So, "Spark first writes all data to a temporary directory and only upon completion attempts to list the directory written to, making sure the folder exists, and only then it renames the checkpoint directory to its real name. Listing a directory after a PUT operation in S3 is eventually consistent per S3 documentation and would be the cause of sporadic failures which caused the checkpointing task to fail entirely." – Harshit Sharma Jun 15 '18 at 17:27
The inconsistency lead to the problem with checkpointing and my application was always failing after startup. There were two options: 1. Fix checkpointing files to exclude the latest batch that caused failure or 2. Find out which files were already processed but still exist in source folder and remove them and then remove the entire checkpointing folder. After some experiments with removing last batch info from checkpointing files I decided to go with the second option. – Yuriy Bondaruk Jun 15 '18 at 17:59
There is no fix to the existing checkpointing. Please follow https://issues.apache.org/jira/browse/SPARK-18512 for latest updates. Until then people should use HDFS as a sink & a place to store checkpoints and then eventually copy the data to S3 from HDFS – Harshit Sharma Oct 10 '18 at 18:25
databricks is able to reliably checkpoint to s3, https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html if you want to avoid using *FS services – chris fish Sep 24 '19 at 20:18
Has there been any change in this regard? Note Databricks is fine. – thebluephantom Jul 18 '20 at 14:47

Fixing checkpointing in spark structured streaming

1 Answers1