2

I'm trying to simplify notebook creation for developers/data scientists in my Azure Databricks workspace that connects to an Azure Data Lake Gen2 account. Right now, every notebook has this at the top:

    %scala
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.<datalake.dfs.core.windows.net", "OAuth")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net", <SP client id>)
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net", dbutils.secrets.get(<SP client secret>))
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant>/oauth2/token")

Our implementation is trying to avoid mounting in DBFSS, so I've been trying to see if I can use the Spark Config on a cluster to define these values instead (each cluster can access a different data lake).

However, I haven't been able to get that to work yet. When I try various flavors of this:

org.apache.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
org.apache.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
org.apache.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
org.apache.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net "https://login.microsoftonline.com/<tenant>"

I get "Failure to initialize configuration" The version above looks like it's defaulting to try to use the storage account access key instead of the SP credentials (this is with just testing with a simple ls command to make sure it works).

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Failure to initialize configuration
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)

I'm hoping there's a way around this, although if the only answer is "you can't do it this way," that's an acceptable answer, of course.

Jason Whitish
  • 1,428
  • 1
  • 23
  • 27
  • Could you please tell me how you configure it? is that you referred to https://learn.microsoft.com/en-us/azure/databricks/clusters/configure#--spark-configuration. – Jim Xu Dec 09 '20 at 05:18
  • I did use the Spark tab on the cluster, yes, and the second code example above is a copy of what I put in that tab. I also referenced this forum post (on how to include the secret) https://forums.databricks.com/questions/17733/injecting-secrets-in-clusters-spark-config.html – Jason Whitish Dec 09 '20 at 14:42

2 Answers2

6

If you want to connect to Azure Data Lake Gen2, include authentication information into Spark configuration as follows:

spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token

For more details, please refer to here and here enter image description here

greatvovan
  • 2,439
  • 23
  • 43
Jim Xu
  • 21,610
  • 2
  • 19
  • 39
  • This worked. I had to change the secret to also use `spark.hadoop` instead of `org.apache` btw. Thank you very much. – Jason Whitish Dec 10 '20 at 16:55
  • @Jim-Xu In `{{secrets/secret/secret}}` I assume the first `secrets` is a literal string, correct? Assuming yes, what sections of Azure do I find values for the second and third `secret`? For example, if I created a brand new `KeyVault` named `foo` and a secret within it named `mySecret`, where do I get the values for the second and third `secret`? Thank you. – NYCeyes Jan 15 '21 at 21:39
2

You might need to configure your cluster to Access ADLS Gen2 directly

Please note this format for accessing secrets: Read a secret

The syntax of the Spark configuration property or environment variable path value must be {{secrets/<scope-name>/<secret-name>}}. The value must start with {{secrets/ and end with }}.

So this line:

spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>

should look like this:

spark.hadoop.fs.azure.account.oauth2.client.secret.yourstorageaccountname.dfs.core.windows.net {{secrets/yoursecretscope/yoursecretname}}

sergdess
  • 21
  • 2