I am trying to read multiple XML files from Azure blob container using Pyspark. When I am running the script in Azure Synapse notebook, I am getting below error.
Note:
- I have tested the connection using Azure Data Lake Gen 2 linked services (both to linked service and to file path)
- I have added my workspace under 'Role Assignments' and given 'Storage Blob Data Contributor' role
Pyspark code:
Throws error at the below line
df = spark.read.format("xml").options(rowTag="x",inferSchema = True).load(xmlfile.path)
Assumption:
Assumption is that I don't have read permission to the XML files, but I am not sure if I am missing anything. Can you please throw some light?
Py4JJavaError: An error occurred while calling o2333.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 3.0 failed 4 times, most recent failure: Lost task 26.3 in stage 3.0 (TID 288) : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://<*path/x.xml*>
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1185)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:200)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:912)
at com.databricks.spark.xml.XmlRecordReader.initialize(XmlInputFormat.scala:86)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:240)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:237)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:192)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:91)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, <path/x.xml>
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus(AbfsClient.java:570)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:627)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:196)
... 23 more