Which is a more efficient orchestrating mechanism, chaining Databricks notebooks together or using Apache Airflow?

Question

The data size for the data is in the Terabytes.

I have multiple Databricks notebooks for incremental data load into Google BigQuery for each dimension table.

Now, I have to perform this data load every two hours i.e. run these notebooks.

What is a better approach among the following:

Create a master Databricks notebook and use dbutils to chain/parallelize the execution of the aforementioned Databricks notebooks.
Use Google Composer (Apache Airflow's Databricks Operator) to create a master DAG to orchestrate these notebooks remotely.

I want to know which is better approach when I have use cases for both parallel execution and sequential execution of said notebooks.

I'd be extremely grateful if I could get a suggestion or opinion on this topic, thank you.

Just an idea without knowing any Databricks: Using the Composer to orchestrate the Databricks notebooks sounds like... Databricks notebooks with extra steps. Airflow gives you a nice, CLI-free UI, but you have to write code for that and host the environment. Takes time and money. — GregK, Sep 29 '21 at 21:26
Hey thanks for the suggestion @GregK. But I have been instructed to use Airflow because we need to track the status of each table which is not possible with Databricks without dwelling into the UI manually. Basically, we have two phases for each table: Loading incremental data from Databricks to a BigQuery staging table, and merging the BigQuery staging data into a warehouse table. While one task is done within Databricks, the other one is done in BQ. We need to represent the status of this with the help of a visual DAG for easy tracking. — Uttkarsh Dutt, Oct 06 '21 at 06:58

score 0 · Accepted Answer · answered Sep 30 '21 at 00:39

0

why can't you try with databricks jobs . So that you can use job for way of running a notebook either immediately or on a scheduled basis.

answered Sep 30 '21 at 00:39

Hey thanks for the answer Karthik. That's what we have ended up doing. So our approach is a combination of Airflow using the [DatabricksRunNowOperator] to remotely trigger Databricks jobs. The documentation for this particular operator is scarce and took some digging to find. It also automatically starts a cluster, it its in terminated state. Refer: https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/_api/airflow/providers/databricks/operators/databricks/index.html – Uttkarsh Dutt Oct 06 '21 at 09:01

1 Answers1