1

I have been getting intermittent issues with when trying to read from an S3 bucket from Databricks in Azure. It can sometimes go months with out working, suddenly work temporarily, and stop again.

The Scala code is as follows:

val access_key = "XXXXXXXXX"
val secret_key = "XXXXXXXXX"
val encoded_secret_key = secret_key.replace("/", "%2F")
val aws_bucket_name = "bucket-name"
val file_path = "filePath"

spark.conf.set("fs.s3n.awsAccessKeyId", access_key)
spark.conf.set("fs.s3n.awsSecretAccessKey", encoded_secret_key)

var df = dbutils.fs.ls(s"""s3a://$aws_bucket_name/$file_path""")

display(df)

Sometimes it will work, other times it won't, all without making any configuration changes. At least not on the code or cluster configuration side. When it does fail, the error is as follows

java.nio.file.AccessDeniedException: s3a:///: getFileStatus on s3a:///: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://..amazonaws.com {} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1063-azure OpenJDK_64-Bit_Server_VM/25.282-b08 java/1.8.0_282 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.GetObjectMetadataRequest; Request ID: , Extended Request ID: <long/id>, Cloud Provider: Azure, Instance ID: (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: ; S3 Extended Request ID: ), S3 Extended Request ID: :403 Forbidden

I'm not even sure how to troubleshoot. Connection works fine with python (boto3) in the same notebook, but the Scala doesn't work.

We are using Spark 3.0.1, Scala 2.12

Ermiya Eskandary
  • 15,323
  • 3
  • 31
  • 44
ewong18
  • 144
  • 1
  • 2
  • 10
  • What is the IAM roles & permissions assigned to the user that you're using the access and secret key for? Exact & complete roles and/or inline policies please. – Ermiya Eskandary Jan 13 '22 at 20:49
  • why are you setting the fs.s3n options? does the databrick document recommend this, or are you just copying from an SO post of ten years ago? – stevel Jan 17 '22 at 12:37
  • @stevel This was just the code we had that was working before. The document I've read only has python code and not scala. What should we be using instead? – ewong18 Jan 18 '22 at 00:27
  • use the s3a docs, not superstition passed down one incorrect SO post at a time https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html – stevel Jan 18 '22 at 17:04

0 Answers0