"SparkException: Job aborted" when Koalas writes to Azure blob storage

Question

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job.

Only a few of the stages seem to fail with the following error:

This request is not authorized to perform this operation using this
permission.

I am handling the data with Databricks on Azure using PySpark. The data products reside in a mounted Azure Blob storage. A service principle for databricks was made and is set as "contributer" to the Azure storage account.

When looking into the storage account, I notice that some of the first blobs were already prepared in the directory. Moreover, I am able to place the output in the blob storage using a "pure Python" approach with pandas. Therefore, I doubt that it has to do with authorization issues for Databricks.

This is the minimal coding example of what I used to create the error.

<Test to see if the blob storage is mounted>
# Import koalas
import databricks.koalas as ks
# Load the flatfile
df = ks.read_csv('/dbfs/spam/eggs.csv')
# Apply transformations
# Write out the dataframe
df.to_csv('/dbfs/bacon/eggs.csv')

Since there are many facets to this issue, I am uncertain where to start:

Authorization issue between the blob storage and Databricks
Incorrect setup of the Databricks cluster
Applying the wrong API method
Issue with the file content

Any leads on where to look?

Are you connecting databricks with Azure Blob Storage or Azure Data Lake Gen2? — CHEEKATLAPRADEEP, Nov 04 '19 at 11:37

"SparkException: Job aborted" when Koalas writes to Azure blob storage

0 Answers0