Create Spark context from Python in order to run databricks sql

Question

I've been following this tutorial which lets me connect to Databricks from Python and then run delta table queries. However, I've stumbled upon a problem. When I run it for the FIRST time, I get the following error:

Container container-name in account storage-account.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.

When I go back to my Databricks cluster and run this code snippet

from pyspark import SparkContext
spark_context =SparkContext.getOrCreate()

if StorageAccountName is not None and StorageAccountAccessKey is not None:
  print('Configuring the spark context...')
  spark_context._jsc.hadoopConfiguration().set(
    f"fs.azure.account.key.{StorageAccountName}.blob.core.windows.net",
    StorageAccountAccessKey)

(where StorageAccountName and AccessKey are known) then run my Python app once again, it runs successfully without throwing the previous error. I'd like to ask, is there a way to run this code snippet from my Python app and at the same time reflect it on my Databricks cluster?

are you running this code via Databrics Connect, or directly on the cluster? — Alex Ott, Nov 29 '21 at 09:49
The code snippet is directly on the cluster. How could I run it from my PyCharm? @AlexOtt — anthino12, Nov 29 '21 at 10:13

score 1 · Answer 1 · answered Nov 29 '21 at 10:55

1

You just need to add these configuration options to the cluster itself as it's described in the docs. You need to set following Spark property, the same as you do in your code:

fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>

For security, it's better to put access key into secret scope, and refer it from Spark configuration (see docs)

answered Nov 29 '21 at 10:55

Alex Ott

80,552
8
87
132

This should go through `databricks-connect` or just `pyspark`? – anthino12 Nov 29 '21 at 11:15
It’s should be done on cluster that you are querying – Alex Ott Nov 29 '21 at 11:16
directly on databricks? sorry if the question seems kind of dumb, I'm the new guy when it comes to databricks – anthino12 Nov 29 '21 at 11:20
Yes, go to your cluster, click Edit, scroll down to "Advanced options", put that configuration into "Spark" part – Alex Ott Nov 29 '21 at 11:26
Ah, I understand. Just one more question, what if I have a few more clusters. Would I need to configure this on all of them? – anthino12 Nov 29 '21 at 11:28
1

yes. this option provides the way for clusters to authenticate to the storage account. But I would recommend to use SAS key instead of storage key – Alex Ott Nov 29 '21 at 11:30
Is there a way to connect to Databricks from a Python IDE and then run this `fs.azure.account.key...` from there? – anthino12 Nov 29 '21 at 11:49
But why do you need this? – Alex Ott Nov 29 '21 at 12:11
I'll tell you the use-case I have. So, I have delta tables in Databricks. My team uses Databricks for manipulating data and it creates delta tables from most of them. Now what I need to build is, create an API which will let the user query a delta table. If a user wants to preview the first 100 rows from delta table A, my API should connect to Databricks, run the query SELECT * FROM A LIMIT 100, the API should get the result and give it back to the client app. Now this app that I work on, uses multiple storage accounts on the same cluster I believe. That's why I need this functionality. 1/2 – anthino12 Nov 29 '21 at 12:27
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239663/discussion-between-anthino12-and-alex-ott). – anthino12 Nov 29 '21 at 12:27

Create Spark context from Python in order to run databricks sql

1 Answers1