Azure databricks dataframe write gives job abort error

Question

I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.

Error Message:

org.apache.spark.SparkException: Job aborted.

Code:

import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])  
df.write.format(source).mode("overwrite").save(path) #error line

Can you please share the code which you are trying to execute ? — HimanshuSinha, Aug 21 '20 at 02:03
Hi @HimanshuSinha-msft, Thank you for the response. Please find OP updated with the code. — paone, Aug 21 '20 at 03:08
Could you please tell me how you access Azure Data Lake Gen2 in databricks? — Jim Xu, Aug 21 '20 at 03:23
Hi @JimXu, Thanks for your response. I use container wasbs://@.blob.core.windows.net/dir — paone, Aug 21 '20 at 03:29
Have you add `spark.conf.set( "fs.azure.account.key..blob.core.windows.net", "")` to your code — Jim Xu, Aug 21 '20 at 03:31
@paone Besides, if you use ADLS Gen2, you need to use `abfss` protocol to access file and add `spark.conf.set( "fs.azure.account.key..dfs.core.windows.net", "")` into your code to to auth. For more details, please refer to https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-datalake-gen2. — Jim Xu, Aug 21 '20 at 03:40
Hi @JimXu. Thanks for your response. I did run the storage account path configuration now and works as expected. Should this be run every time we restart the cluster? — paone, Aug 21 '20 at 03:41
Yes, you should do that. If you just want to run one time, you can mount Azure blob storage or ADLS Gen2 as the file system in databricks then you can access these as local file system like `/mnt/..`. — Jim Xu, Aug 21 '20 at 03:43
@JimXu Would you like to post that as an answer, so that OP can mark this question as answered? — CHEEKATLAPRADEEP, Aug 21 '20 at 04:58
@paone I summarize my suggestions as a solution. Since it is useful for you, could you please [accept it as an answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). — Jim Xu, Aug 24 '20 at 01:10

score 2 · Accepted Answer · answered Aug 24 '20 at 01:07

I summarize the solution below

If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.

Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.

a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account

 az login

 az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>

b. code

 configs = {"fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<appId>",
  "fs.azure.account.oauth2.client.secret": "<clientSecret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
  "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

 dbutils.fs.mount(
    source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
    mount_point = "/mnt/flightdata",
    extra_configs = configs)

Access directly using the storage account access key.

We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/.

for example

 from pyspark.sql.types import StringType
 spark.conf.set(
   "fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")

  df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
  df.show()
  df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://test@testadls05.dfs.core.windows.net/result_csv')

For more details, please refer to here

Azure databricks dataframe write gives job abort error

1 Answers1