21

For checkout purpose I try to set up an Amazon S3 bucket as checkpoint file.

val checkpointDir = "s3a://bucket-name/checkpoint.txt"
val sc = new SparkContext(conf)
sc.setLocalProperty("spark.default.parallelism", "30")
sc.hadoopConfiguration.set("fs.s3a.access.key", "xxxxx")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "xxxxx")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "bucket-name.s3-website.eu-central-1.amazonaws.com")
val ssc = new StreamingContext(sc, Seconds(10))
ssc.checkpoint(checkpointDir)

but it stops with this exception

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 9D8E8002H3BBDDC7, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: Qme5E3KAr/KX0djiq9poGXPJkmr0vuXAduZujwGlvaAl+oc6vlUpq7LIh70IF3LNgoewjP+HnXA=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:232)
at com.misterbell.shiva.StreamingApp$.main(StreamingApp.scala:89)
at com.misterbell.shiva.StreamingApp.main(StreamingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I don't understand why I got this error and I can't find any example.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
crak
  • 1,635
  • 2
  • 17
  • 33

3 Answers3

29

This message correspond to something like "bad endpoint" or bad signature version support.

like seen here frankfurt is the only one that not support signature version 2. And it's the one I picked.

Of course after all my reserch can't say what is signature version, it's not obvious in the documentation. But the V2 seems to work with s3a.

The endpoint seen in the S3 interface is not the real endpoint it's just the web endpoint.

you have to use one of theses endpoint like that sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.eu-west-1.amazonaws.com")

But it's work by default with US endpoint

crak
  • 1,635
  • 2
  • 17
  • 33
  • 3
    I can confirm that the different aws regions use different versions of eg. sha256. So one should try to use most recent compatible versions eg. of aws-java-sdk – PlagTag Apr 12 '18 at 14:12
  • 1
    equivalent of the above command in PySpark is `sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-west-1.amazonaws.com")` – P D Jul 15 '21 at 07:23
9

If you'd like to anyway use the region that supports Signature V4 in spark you can pass flag -Dcom.amazonaws.services.s3.enableV4 to the driver options and executor options on runtime. For example:

spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)

With this settings Spark is able to write to Frankfurt (and other V4-only regions) even with not-so-fresh AWS sdk version (com.amazonaws:aws-java-sdk:1.7.4 in my case)

Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • Which Spark version? 2.4? – c74ckds Feb 13 '20 at 20:25
  • Yes, I used it with 2.4.3 – Mariusz Feb 13 '20 at 20:37
  • Just tried it myself also, but had some other issues and was wondering if you tried it with 2.4. Now it works for me also. Thanks! – c74ckds Feb 13 '20 at 20:50
  • Trying with Spark 2.4.5, I had to do both this and @crak's reply to get Spark to read the file for region ap-south-1. Either one of them individually was not enough. Was using `com.amazonaws:aws-java-sdk:1.7.4`. – Samik R Apr 09 '20 at 18:13
  • This saved me `%spark.conf spark.jars /spark-additional-jars/spark-avro_2.11-2.4.3.jar,/spark-additional-jars/hudi-spark-bundle_2.11-0.6.0.jar,/spark-additional-jars/hudi-utilities-bundle_2.11-0.6.0.jar,/spark-additional-jars/hadoop-aws-2.7.3.jar,/spark-additional-jars/hadoop-common-2.7.3.jar,/spark-additional-jars/aws-java-sdk-1.7.4.jar spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions -Dcom.amazonaws.services.s3.enableV4` extraJavaOptions -our saviour – user2455668 Nov 19 '20 at 19:30
4

I was facing the same issue when running spark locally, for me reason was SIGV4 was not getting set, this code helped me:

import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
AbhiK
  • 247
  • 3
  • 19
  • I also facing same issue while hitting S3 Bucket to upload file from android application. I am using retrofit library. – sudhanshu Sep 01 '21 at 05:21