1

I'm trying to read a CSV file from an AWS S3 bucket with Spark, currently doing it through a Jupyter notebook.

After setting up the AWS S3 configurations for spark I am getting this error when trying to read the CSV:

Py4JJavaError: An error occurred while calling SOMERANDOMNAME.csv.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXX, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: XXXXXXXXXXX

The way I am setting up the configuration:

hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key_id)
hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key)
hadoopConf.set("fs.s3a.path.style.access", "true")

The way I'm trying to read the CSV:

data = spark.read.csv('s3a://' + s3_bucket + '/data.csv',sep=",", header=True)

Running that block sends me the error above. Could you help me out on what could be going wrong?

Thank you in advance!

Cesar Flores
  • 762
  • 7
  • 16
  • Does this answer your question? [Amazon s3a returns 400 Bad Request with Spark](https://stackoverflow.com/questions/34209196/amazon-s3a-returns-400-bad-request-with-spark) – mck Feb 09 '21 at 16:32
  • not really, I tried with what is written there and no luck still getting the error. ```Py4JJavaError: An error occurred while calling o894.csv. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage 15.0 (TID 73, 11.111.1.11, executor 0): com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID:, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID:``` – Cesar Flores Feb 10 '21 at 11:38

1 Answers1

0

Ok I was able to make it work after all, so I'm answering my own question.

I needed to first update the packages passed to the spark-submit at runtime. I was using org.apache.hadoop:hadoop-aws:2.7.3, I changed it to org.apache.hadoop:hadoop-aws:2.7.7. Secondly, I passed on these configurations to the spark executor and the driver to enable V4 signature. --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

The spark-submit arguments then looked like this (when run in a notebook):

os.environ['PYSPARK_SUBMIT_ARGS'] = f"--conf spark.jars.ivy={os.environ['HOME']} --packages org.apache.hadoop:hadoop-aws:2.7.7,com.amazonaws:aws-java-sdk:1.7.4 --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true pyspark-shell"

Then at runtime, I defined the following configurations

hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key_id)
hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "true")

Finally, when reading the file I did this:

data = spark.read.csv('s3a://' + s3_bucket + '/data.csv', sep=",", header=True)

I realized that this only happened to me when reading from a bucket in the us-east-2 region, and doing the same in us-east-1 with the configurations of my question I got it working right. In summary, The key was actually enabling the V4 signature.

Cesar Flores
  • 762
  • 7
  • 16