3

I am trying to write data to a csv files and store the file on Azure Data Lake Gen2 and run into job aborted error message. This same code used to work fine previously.

Error Message:

org.apache.spark.SparkException: Job aborted.   

Code:

import requests
response = requests.get('https://myapiurl.com/v1/data', auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])  
df.write.format(source).mode("overwrite").save(path) #error line
paone
  • 828
  • 8
  • 18
  • Can you please share the code which you are trying to execute ? – HimanshuSinha Aug 21 '20 at 02:03
  • Hi @HimanshuSinha-msft, Thank you for the response. Please find OP updated with the code. – paone Aug 21 '20 at 03:08
  • Could you please tell me how you access Azure Data Lake Gen2 in databricks? – Jim Xu Aug 21 '20 at 03:23
  • Hi @JimXu, Thanks for your response. I use container wasbs://@.blob.core.windows.net/dir – paone Aug 21 '20 at 03:29
  • 1
    Have you add `spark.conf.set( "fs.azure.account.key..blob.core.windows.net", "")` to your code – Jim Xu Aug 21 '20 at 03:31
  • 1
    @paone Besides, if you use ADLS Gen2, you need to use `abfss` protocol to access file and add `spark.conf.set( "fs.azure.account.key..dfs.core.windows.net", "")` into your code to to auth. For more details, please refer to https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-datalake-gen2. – Jim Xu Aug 21 '20 at 03:40
  • Hi @JimXu. Thanks for your response. I did run the storage account path configuration now and works as expected. Should this be run every time we restart the cluster? – paone Aug 21 '20 at 03:41
  • @paone the `wasbs` protocol is used for Azure blob storage. – Jim Xu Aug 21 '20 at 03:41
  • Yes, you should do that. If you just want to run one time, you can mount Azure blob storage or ADLS Gen2 as the file system in databricks then you can access these as local file system like `/mnt/..`. – Jim Xu Aug 21 '20 at 03:43
  • @JimXu Would you like to post that as an answer, so that OP can mark this question as answered? – CHEEKATLAPRADEEP Aug 21 '20 at 04:58
  • @paone I summarize my suggestions as a solution. Since it is useful for you, could you please [accept it as an answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – Jim Xu Aug 24 '20 at 01:10

1 Answers1

2

I summarize the solution below

If you want to access Azure data lake gen2 in Azure databricks, you have two choices to do that.

  1. Mount Azure data lake gen2 as Azure databricks's file system. After doing that, you can read and write files with the path /mnt/<>. And We just need to run the code one time.

    a. Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account

     az login
    
     az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
    --scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
    

    b. code

     configs = {"fs.azure.account.auth.type": "OAuth",
      "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
      "fs.azure.account.oauth2.client.id": "<appId>",
      "fs.azure.account.oauth2.client.secret": "<clientSecret>",
      "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
      "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
    
     dbutils.fs.mount(
        source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
        mount_point = "/mnt/flightdata",
        extra_configs = configs)
    
  2. Access directly using the storage account access key.

    We can add the code spark.conf.set( "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-name>") to our script. Then we can read and write files with path abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/.

    for example

     from pyspark.sql.types import StringType
     spark.conf.set(
       "fs.azure.account.key.testadls05.dfs.core.windows.net", "<account access key>")
    
      df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
      df.show()
      df.coalesce(1).write.format('csv').option('header', True).mode('overwrite').save('abfss://test@testadls05.dfs.core.windows.net/result_csv') 
    

    enter image description here

For more details, please refer to here

Jim Xu
  • 21,610
  • 2
  • 19
  • 39