Default values in Databricks deployment.yaml file

Question

In our deployment.yaml file we have basically they same instructions for each environment, but there are some settings I might want to se differently per environment, e.g. schedules.

Can I e.g. define a default profile, where I would put the steps once and then have just have override values per environment?

  default:
    workflows:
      - name: "Load_Daily"
        schedule:
          quartz_cron_expression: "0 1 * * * ?" #
          timezone_id: "Europe/Helsinki"
          pause_status: "PAUSED"
        job_clusters:
          - job_cluster_key: "default"
            <<: *basic-static-cluster
        max_concurrent_runs: 1
  prod:
    workflows:
      - name: "Load_Daily"
        schedule:
          quartz_cron_expression: "0 */1 * * * ?" #
          pause_status: "UNPAUSED"

score 1 · Answer 1 · answered Aug 01 '23 at 12:35

Yes, you can :)
You just have to create a reference to the default workflow (like you did for basic-static-cluster).

Here is a working example based on yours:

default:
  workflows:
    - &default_workflow
      name: "Load_Daily"
      schedule:
        quartz_cron_expression: "0 1 * * * ?" #
        timezone_id: "Europe/Helsinki"
        pause_status: "PAUSED"
      job_clusters:
        - job_cluster_key: "default"
      max_concurrent_runs: 1

prod:
  workflows:
    - <<: *default_workflow
      schedule:
        quartz_cron_expression: "0 */1 * * * ?" #
        pause_status: "UNPAUSED"

If you load it and print it, you can see that it replaced only the schedule part as you requested (assuming you saved the yaml as yaml_test.yml:

import yaml
from pprint import pprint

with open("yaml_test.yml","r") as f:
    yaml_conf = yaml.safe_load(f)

pprint(yaml_conf)

The output is:

{'default': {'workflows': [{'job_clusters': [{'job_cluster_key': 'default'}],
                            'max_concurrent_runs': 1,
                            'name': 'Load_Daily',
                            'schedule': {'pause_status': 'PAUSED',
                                         'quartz_cron_expression': '0 1 * * * '
                                                                   '?',
                                         'timezone_id': 'Europe/Helsinki'}}]},
 'prod': {'workflows': [{'job_clusters': [{'job_cluster_key': 'default'}],
                         'max_concurrent_runs': 1,
                         'name': 'Load_Daily',
                         'schedule': {'pause_status': 'UNPAUSED',
                                      'quartz_cron_expression': '0 */1 * * * '
                                                                '?'}}]}}

We solved this with some jinja script in our pipeline. But it seems this will be nativly supported through Databricks Asset Bundles (https://www.databricks.com/resources/demos/tours/data-engineering/databricks-asset-bundles) — Mathias Rönnlund, Aug 22 '23 at 05:42
I'm using Databricks Bundles, and these YAML "tricks" are still very relevant. Databricks Bundles do not support yet the definition of "re-usable" blocks like in the example, only single-valued variables. I'm working on a data-pipeline template for my company and uses this methodology to define different cluster types for Jobs in different environments. — M.S., Aug 22 '23 at 13:42

Default values in Databricks deployment.yaml file

1 Answers1