1

I am setting up a development environment as a Docker container image. This will allow me and my colleagues to get up and running quickly using it as an interpreter environment. Our intended workflow is to develop code locally and execute it on an Azure Databricks cluster that's connected to various data sources. For this I'm looking into using databricks-connect.

I am running into the configuration of databricks-connect apparently solely being an interactive procedure. This results in having to run databricks-connect configure and supplying various configuration values each time the Docker container image is run, which is likely to become a nuisance.

Is there a way to configure databricks-connect in a non-interactive way? This would allow me to include the configuration procedure in the development environments Dockerfile and a developer being only required to supply configuration values when (re)building their local development environment.

Wouter Hordijk
  • 123
  • 1
  • 5

2 Answers2

4

Yes - it’s possible, there are different ways for that:

  • use shell multi line input, like this (taken from here) - just need to define correct environment variables:
echo "y
$databricks_host
$databricks_token
$cluster_id
$org_id
15001" | databricks-connect configure
  • generate config file directly - it’s just JSON that you need to fill with necessary parameters. Generate it once, look into ~/.databricks-connect and reuse.

But really you may not need configuration at all - Databricks connect can take information either from environment variables (like DATABRICKS_ADDRESS) or Spark configuration (like spark.databricks.service.address) - just refer to official documentation.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks, the first part of your answer is very helpful and will definitely get the job done! About the second part: i) I am not aware of a configuration file being created after configuration and which should be possible to create beforehand: do you have any reference for this? ii) Setting environment variables did not work for me, and setting the Spark configuration led me into another interactive session so I abandoned this route. Am I missing something in the official docs? – Wouter Hordijk Dec 24 '21 at 22:07
  • 1
    Config file is really created after configuration - its name is `~/.databricks-connect`. Regarding environment variables - it’s really strange - I’ve used some time ago and they worked just fine – Alex Ott Dec 24 '21 at 22:34
  • Thanks, I did not find this config file before! – Wouter Hordijk Dec 24 '21 at 22:37
  • Setting env variables does not work, fails because `databricks-connect` is looking for the `.databricks-connect` file before checking env variables: Caused by: java.lang.RuntimeException: Config file /home/viadot/.databricks-connect not found. Please run `databricks-connect configure` to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: – Michał Zawadzki Oct 23 '22 at 14:07
  • Hmmm, maybe it was a behaviour change since my answer... Can you try to create an empty file, or file with `{}` content? – Alex Ott Oct 23 '22 at 18:12
0

Above didn't work for me, this however did:

with open(os.path.expanduser("~/.databricks-connect"), "w") as f:
    json.dump(db_connect_config, f)
spark = SparkSession.builder.getOrCreate()

Where db_connect_config is a dictionary with the credentials.

Michał Zawadzki
  • 695
  • 6
  • 14