I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.
Only a few of the stages seem to fail with the following error:
This request is not authorized to perform this operation using this
permission.
I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.
When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.
This is the minimal coding example of what I used to create the error.
<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')
Since there are many facets to this issue, I am uncertain where to start:
Authorization issue between the blob storage and Databricks
Incorrect setup of the Databricks cluster
Applying the wrong API method
Issue with the file content
Any leads on where to look?