1

I want to move data from S3 bucket to Databricks. On both platforms I have separate environments for DEV, QA, and PROD.

I use a Databricks notebook which I deploy to Databricks using terraform.

Within the notebook there are some hardcoded variables, pointing at the specific AWS account and bucket.

I want to dynamically change those variables based on to which Databricks environment I deploy the notebook.

It probably can be achieved with Databricks secrets, but I'd rather not use Databricks CLI. Are there other options?

Does terraform provide control over specific code cells within a notebook?

wookash
  • 11
  • 4
  • 1
    As "use notebook" you mean deploying it as a job? – Alex Ott Jul 25 '23 at 13:41
  • @AlexOtt, So far I'm only deploying a notebook and run it manually to test how things work. Eventually, I will probably also create a job and attach the notebook to it to automate things. – wookash Jul 26 '23 at 05:32

2 Answers2

0

There are different options to achieve this:

  • Hardcode all constants in the source code and then select what is necessary via widgets, something like this (you can select value interactively or pass it as a parameter of the notebook task in a job):
dbutils.widgets.dropdown("env", "dev", ["dev", "prod"])
# separate cell
env = dbutils.widgets.get("env")
if env == "dev":
  bucket = "..."
  ...
elif env == "prod":
  bucket = "..."
else:
 raise Exception("Unknown environment")
  • You can inject necessary variables into a notebook template using Terraform's built-in templatefile function when deploying to a specific environment

  • Or when you're using databricks_job, then you can simply pass all parameters in the base_parameters map, and then you can pull these parameters via dbutils.widgets.get.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks for the answer. I knew about the widgets, but I didn't want to use them, to avoid the need of setting it up manually with each run. I didn't know about the base_parameters and ended up using env variables for the cluster. But I'll definitely look into them in the future. – wookash Aug 03 '23 at 07:46
0

I ended up using cluster's environment variables.

resource "databricks_job" "my_job" {
  # (...)
  new_cluster {
    # (...)
    spark_env_vars = {
      "ENVIRONMENT" : var.environment
    }
  }

  notebook_task {
    notebook_path = databricks_notebook.my_notebook.path
  }
}

Then in the notebook I hardcoded the constants in a dictionary and I select them by the cluster's environment variable:

from os import environ
db_env = environ["ENVIRONMENT"]

aws_account_ids = {
    "dev": 123,
    "qa": 456,
    "prod": 789,
}

aws_account_id = aws_account_ids[db_env]

wookash
  • 11
  • 4