Unable to read XML files from Azure Blob container using Pyspark

Question

I am trying to read multiple XML files from Azure blob container using Pyspark. When I am running the script in Azure Synapse notebook, I am getting below error.

Note:

I have tested the connection using Azure Data Lake Gen 2 linked services (both to linked service and to file path)
I have added my workspace under 'Role Assignments' and given 'Storage Blob Data Contributor' role

Pyspark code:

Throws error at the below line

df = spark.read.format("xml").options(rowTag="x",inferSchema = True).load(xmlfile.path)

Assumption:

Assumption is that I don't have read permission to the XML files, but I am not sure if I am missing anything. Can you please throw some light?

Py4JJavaError: An error occurred while calling o2333.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 3.0 failed 4 times, most recent failure: Lost task 26.3 in stage 3.0 (TID 288) : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://<*path/x.xml*>
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1185)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:200)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:187)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:912)
    at com.databricks.spark.xml.XmlRecordReader.initialize(XmlInputFormat.scala:86)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:240)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:237)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:192)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:91)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, <path/x.xml>
    at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
    at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus(AbfsClient.java:570)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:627)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:196)
    ... 23 more

If you know the schema of that file, copy it to the local machine and try your code on the local file. If the code works, then there's probably permissions problem. — ZygD, Aug 08 '22 at 12:51

Venkatesan · Answer 1 · 2022-08-09T08:24:26.433

Operation failed: "This request is not authorized to perform this operation using this permission.", 403

The error occurs when you don't have proper role assigned to access storage account. Please make sure you have "storage blob contributor" role

I tried in my environment with same process without role assigned myself. I got similar error:

enter image description here

When I added role assignements "storage blob contributor" to my user account and run the code, got file successfully .

enter image description here

Output:

enter image description here

Reference:

Error: This request is not authorized to perform this operation using this permission.", 403 in Azure synapse notebook while running from pyspark - Microsoft Q&A -Pradeepcheekatala-MSFT

Unable to read XML files from Azure Blob container using Pyspark

1 Answers1