1

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.

Only a few of the stages seem to fail with the following error:

This request is not authorized to perform this operation using this
permission.

I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.

When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.

This is the minimal coding example of what I used to create the error.

<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')

Since there are many facets to this issue, I am uncertain where to start:

  • Authorization issue between the blob storage and Databricks

  • Incorrect setup of the Databricks cluster

  • Applying the wrong API method

  • Issue with the file content

Any leads on where to look?

bramb
  • 213
  • 2
  • 14

0 Answers0