5

I am facing the following error while writing to S3 bucket using pyspark.

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: A0B0C0000000DEF0, AWS Error Code: InvalidArgument, AWS Error Message: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.,

I have applied server-side encryption using AWS KMS service on the S3 bucket. I am using the following spark-submit command -

spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 --jars sample-jar sample_pyspark.py 

This is the sample code I am working on -

spark_context = SparkContext()
sql_context = SQLContext(spark_context) 
spark = SparkSession.builder.appName('abc').getOrCreate()
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#Have a spark dataframe 'source_data
source_data.coalesce(1).write.mode('overwrite').parquet("s3a://sample-bucket")

Note: Tried to load the spark-dataframe into s3 bucket [without server-side encryption enabled] and it was successful

jps
  • 20,041
  • 15
  • 75
  • 79
Sow
  • 71
  • 1
  • 4
  • you are using a five year old copy of the s3a connector. Try using a release of spark with the hadoop-3.1 binaries and see what happens there. – stevel Oct 19 '20 at 10:46
  • solution: https://stackoverflow.com/a/56855992/1465609 – wind Apr 20 '21 at 06:34

1 Answers1

2

The error seems to be telling you to enable V4 S3 signatures on the Amazon SDK. One way to do it is from the command line:

spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
    ... (other spark options)

That said, I agree with Steve that you should use a more recent hadoop library.

References:

choy
  • 416
  • 1
  • 4
  • 8