Should you not use S3 as the Checkpoint location for a Spark Structured Streaming job?

Question

I am exploring my options for Checkpointing with Spark Structured Streaming and have read that the "eventual consistency" of S3 is not ideal for checkpointing. I am trying to determine whether this is this accurate? I doubt that a Spark Structured Streaming job would write to the checkpoint location and subsequently read from it within the job to determine where to continue from. Wouldn't the current checkpoint also be stored in memory within the context of the job (which means that reading from S3 would not be required to determine the current checkpoint)?

I am able to specify a location on S3 for checkpointing but I am trying to determine whether this is against best practices. Can someone please clarify whether it would not be optimal to use S3 as the Checkpoint location and if so, why?

Does this answer your question? [Apache Spark (Structured Streaming) : S3 Checkpoint support](https://stackoverflow.com/questions/42006664/apache-spark-structured-streaming-s3-checkpoint-support) — thebluephantom, Jul 14 '20 at 16:58
1. Can you clarify how I am ignoring sound advice from experts? 2. The post you have linked above is already referenced in my question and is part of what I am trying to validate. S3 was not a valid option for checkpoint (as per the post from 2017) but it now is. I am trying to verify whether prior concerns are still valid or if they have been addressed (which is why it is possible to use S3 for checkpointing). — Brandon, Jul 14 '20 at 18:53
Well where I worked in the past they had all the issues Steve mentioned and in fact they decided to dispense with. Now and for a while all the rage were it not that you need managed services if you want interactively query S3. You do not need to agree, that is still possible today. This is how I see it. — thebluephantom, Jul 14 '20 at 19:01
If others answer and prove otherwise then I will be enlightened. So, let's hope someone else posts. — thebluephantom, Jul 14 '20 at 19:04
I am surprised no responses. Have you made any progress to refute the prior viewpoint here? If so please post an answer. — thebluephantom, Jul 17 '20 at 19:44
I will answer tomorrow having looked at an aws architdcture i did and speaking with an ex colleague. — thebluephantom, Jul 18 '20 at 20:40

thebluephantom · Answer 1 · 2020-07-20T09:27:03.273

I did a study on Analytics Cloud Architecture 18 months ago. AWS EC2, AWS EMR, Notebooks vs. Classical. I had a look at those notes and googled around to look for changes.

Your initial assumptions prevail, but with a nuance. Some pointers:

If you are using S3 on its own as checkpoint location, it was pointed out by the AWS contact that there could be issues on performance and reliability.
For Databricks, they state that dbfs can be used as checkpoint location - which is synonymous with S3-backend. They have engineered around this as part of a managed service / environment.
Quoble https://www.qubole.com/blog/structured-streaming-with-direct-write-checkpointing/ offer a service for S3 checkpointing. This tells me in conjunction with Databricks approach that just using S3 as checkpoint location with no due consideration, is still an issue and thus not recommended.

Should you not use S3 as the Checkpoint location for a Spark Structured Streaming job?

1 Answers1