Unable to read xlsx file to pyspark dataframe from azure blob storage container

Question

I am trying to load data from the Azure storage container to the Pyspark data frame in Azure Databricks. When I read txt or CSV files it is working. But when I try to read .xlsx files I am getting the following issue.

Apache Spark 3.2.0, Scala 2.12

Below are the steps I am performing

spark.conf.set("fs.azure.account.key.teststorage.blob.core.windows.net",
"**********************")

It is working

df = spark.read.format("csv").option("header", "true") \
  .option("inferSchema", "true") \
  .load("wasbs://testcontainer@teststorage.blob.core.windows.net/data/samplefile.txt")

Not working

df = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true").option("inferSchema","true") \
  .load("wasbs://testcontainer@teststorage.blob.core.windows.net/data/samplefile.xlsx")

Getting below error while loading xlsx files:

: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container producer in account teststorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
    at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1063)
    at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:512)
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1384)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:35)
at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$2(WorkbookReader.scala:41)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:355)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:322)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:322)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:235)
at sun.reflect.GeneratedMethodAccessor338.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

Note: I am able to read from dbfs and mount point.

Saideep Arikontham · Answer 1 · 2023-01-30T07:00:52.613

0

This problem arises due to the public access level of the blob storage container. When the container has Private or Blob public access level, the same error occurs for excel files. But when Container public access level blob storage container is used, you will be able to read the excel files without error. This is what I got while trying to reproduce the issue.

A simple solution would be either to change the public access level of the container to Container, or to mount the blob storage account to the Databricks file system (which is working for you). If you choose to change the public access level of container, go to the container in your blob storage, and you will find the option “Change access level” where you can select Container level.

Navigate to container of storage account and change access level.

Go back to Databricks, run the Dataframe read again which works without any error.

df2 = spark.read.format("com.crealytics.spark.excel") \
      .option("header", "true").option("inferSchema","true") \
      .load("wasbs://<container>@<storage_acc>.blob.core.windows.net/data.xlsx")

Please refer to the following document to understand more about accessing blob storage account using Databricks.

https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage

edited Jan 30 '23 at 07:00

answered May 12 '22 at 16:46

Saideep Arikontham

5,558
2
3
11

This is very incorrect answer. I have a blob storage with private access and still I'm able to read excel files using a `wasbs` path and a spark.config setting in cluster's advanced properties (Eg: `spark.hadoop.fs.azure.account.key..blob.core.windows.net `) – A5H1Q Jan 28 '23 at 18:51
1

See https://stackoverflow.com/a/75270203/10532753 – A5H1Q Jan 28 '23 at 19:18
Hey @A5H1Q, the solution given is incomplete but not incorrect. Using `spark.conf.set("fs.azure.account.key.teststorage.blob.core.windows.net", "**********************")` in notebook and trying to read excel file is still throwing the same error (Same as specified by the user). The provided solution was only for this case. Your solution to configure the storage account access key in cluster's advanced properties works perfectly for any type of container. – Saideep Arikontham Jan 29 '23 at 06:26
1

Hey @Saideep, Whatever works for the OP. but keep in mind, Making a container accessible to public is a greater security risk, especially when it allows anonymous users to enumerate the contents of the container. Yeah you're right, if your approach works, it no longer is incorrect. but the method itself is a security risk – A5H1Q Jan 29 '23 at 07:14

score 0 · Answer 2 · answered Mar 07 '23 at 17:26

0

This answer worked for me... (Missing peice was setting the AccountName/Key at the cluster level & not the notebook level)

Link

answered Mar 07 '23 at 17:26

smooth_smoothie

1,319
10
27
40

Unable to read xlsx file to pyspark dataframe from azure blob storage container

2 Answers2