Apache Spark (Structured Streaming) : S3 Checkpoint support

Question

From the spark structured streaming documentation: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

And sure enough, setting the checkpoint to a s3 path throws:

17/01/31 21:23:56 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://xxxx/fact_checkpoints/metadata, expected: hdfs://xxxx:8020 
java.lang.IllegalArgumentException: Wrong FS: s3://xxxx/fact_checkpoints/metadata, expected: hdfs://xxxx:8020 
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:652) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) 
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) 
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) 
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301) 
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1430) 
        at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51) 
        at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100) 
        at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232) 
        at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) 
        at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) 
        at com.roku.dea.spark.streaming.FactDeviceLogsProcessor$.main(FactDeviceLogsProcessor.scala:133) 
        at com.roku.dea.spark.streaming.FactDeviceLogsProcessor.main(FactDeviceLogsProcessor.scala) 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:498) 
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) 
17/01/31 21:23:56 INFO SparkContext: Invoking stop() from shutdown hook

A couple of questions here:

Why is s3 not supported as a checkpoint dir (regular spark streaming supports this)? What makes a filesystem "HDFS compliant" ?
I use HDFS emphemerally (since clusters can come up or down all the time) and use s3 as the place to persist all data - what would be the recommendations for storing checkpointing data for structured streaming data in such a setup?

Pure guess here but have you tried s3n or s3a (preferably s3a) protocols? — ImDarrenG, Feb 02 '17 at 16:27

stevel · Accepted Answer · 2018-04-24T20:10:21.077

What makes an FS HDFS "compliant?" it's a file system, with the behaviours specified in Hadoop FS specification. The difference between an object store and FS is covered there, with the key point being "eventually consistent object stores without append or O(1) atomic renames are not compliant"

For S3 in particular

It's not consistent: after a new blob is created, a list command often doesn't show it. Same for deletions.
When a blob is overwritten or deleted, it can take a while to go away
rename() is implemented by copy and then delete

Spark streaming checkpoints by saving everything to a location and then renaming it to the checkpoint directory. This makes the time to checkpoint proportional to the time to do a copy of the data in S3, which is ~6-10 MB/s.

The current bit of streaming code isn't suited for s3

For now, do one of

checkpoint to HDFS and then copy over the results
checkpoint to a bit of EBS allocated and attached to your cluster
checkpoint to S3, but have a long gap between checkpoints so that the time to checkpoint doesn't bring your streaming app down.

If you are using EMR, you can pay the premium for a consistent, dynamo DB backed S3, which gives you better consistency. But copy time is still the same, so checkpointing will be just as slow

We have a 40 second interval between checkpoints to S3 and we still occasionally have checkpointing problems such as temp directory being written to and then not found. — Yuval Itzchakov, Feb 05 '17 at 21:12
the checkpoint not being found is probably s3's consistency surfacing: listings tend to lag changes in the object store. Normally you don't notice, but sometimes it surfaces. Using dynamo for the metadata store should work...at least if it doesn't, it's been implementing wrongly — stevel, Feb 06 '17 at 16:04

score 8 · Answer 2 · answered Feb 02 '17 at 21:19

8

This is a known issue: https://issues.apache.org/jira/browse/SPARK-19407

Should be fixed in the next release. You can set the default file system to s3 using --conf spark.hadoop.fs.defaultFS=s3 as a workaround.

answered Feb 02 '17 at 21:19

zsxwing

20,270
4
37
59

2

Don't think this is resolved yet. Still unable to checkpoint structured streaming on S3 (spark 2.1.1) . The checkpoint recovery fails with: 7/06/29 00:29:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint org.apache.spark.sql.AnalysisException: This query does not support recovering from checkpoint location. – Apurva Jun 29 '17 at 01:03
That's a different issue. Are you using "memory" or "console" which doesn't support recovery? – zsxwing Jun 29 '17 at 07:40
I try to use spark in yarn with client mode. And have the same problem – Grigoriev Nick Nov 06 '19 at 09:32

Jayesh Lalwani · Answer 3 · 2022-03-03T17:35:33.290

5

This problem is fixed in https://issues.apache.org/jira/browse/SPARK-19407.

However Structured Streaming checkpointing doesn't work well in S3 because of lack of eventual consistency in S3. It's not a good idea to use S3 for checkpointing https://issues.apache.org/jira/browse/SPARK-19013.

Micheal Armburst has said that this won't be fixed in Spark, and the solution is to wait for S3guard to be implemented. S3Guard is sometime away.

Edit: 2 developments since this post was made a) Support for S3Guard was merged in Spark 3.0. b) AWS made S3 immediately consistent.

edited Mar 03 '22 at 17:35

answered Jul 12 '17 at 14:50

Jayesh Lalwani

335
4
11

Can I ask if this has changed? – thebluephantom Jul 14 '20 at 19:09

Pardeep · Answer 4 · 2021-04-15T07:44:51.587

Yes, if you are using Spark Structured Streaming version 3 or above. First, create a SparkSession and add S3 configs to its context.

val sparkSession = SparkSession
    .builder()
    .master(sparkMasterUrl)
    .appName(appName)
    .getOrCreate()

sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "accessKey")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "secretKey")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3URL:s3Port")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

Later, add checkpointLocation config with S3 bucket before you start the query. For example:

val streamingQuery = streamingDF.writeStream
    .option("checkpointLocation", "s3a://bucketName/checkpointDir/")
    .foreachBatch{(batchDF: DataFrame, batchId: Long) =>
       // Transform and write batchDF
     }.start()

streamingQuery.awaitTermination()

score 0 · Answer 5 · answered Sep 02 '20 at 09:01

0

you can use s3 for checkpoint but you should enable EMRFS, so that s3 consistency will be handled.

answered Sep 02 '20 at 09:01

raj singh

21
1

Apache Spark (Structured Streaming) : S3 Checkpoint support

5 Answers5

Linked