I'm trying to read a CSV file from an AWS S3 bucket with Spark, currently doing it through a Jupyter notebook.
After setting up the AWS S3 configurations for spark I am getting this error when trying to read the CSV:
Py4JJavaError: An error occurred while calling SOMERANDOMNAME.csv.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXX, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: XXXXXXXXXXX
The way I am setting up the configuration:
hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key_id)
hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key)
hadoopConf.set("fs.s3a.path.style.access", "true")
The way I'm trying to read the CSV:
data = spark.read.csv('s3a://' + s3_bucket + '/data.csv',sep=",", header=True)
Running that block sends me the error above. Could you help me out on what could be going wrong?
Thank you in advance!