2

I am developing a Databricks Pipeline, writing my DLTs in Python. I want to understand how to control the Pipeline's order of creation of DLTs.

Currently, the Pipeline attempts to create every single DLT in the order that they're written in, resulting in error if some data is not available. Let me clarify:

@dlt.table
def table1():
  return spark.sql("SELECT a,b,c FROM table_A") 


@dlt.table
def table2():
  return spark.sql("SELECT x,y,z FROM table_B") 


@dlt.table
def table3():
  res1 = dlt.read("table1")
  res2 = dlt.read("table2")
  
  if "a" in res1.schema.names and "x" in res2.schema.names:
    return ...
  elif "a" in res1.schema.names
    return ...
  elif "x" in res2.schema.names
    return ...
  else return ...

I want the Pipeline to just create table3, checking the if else conditions on whether the data from other sources will be available first before creating DLTs table1 or/and table2.

Is this possible or am I misunderstanding something about how Pipelines are supposed to work? You can assume that the data will be present at some moment, but there is a possibility that is has not been yet loaded into the database.

JJ Kam
  • 91
  • 7

1 Answers1

0

When DLT starts the pipeline, it evaluates each of the functions, creating the dependency graph. And then this graph is executed according to the detected order of dependencies. This execution doesn't depend on the presence of the actual data - it depends only on the existence of the input data, etc.

So in your case, table3 is dependent on the table1, and table2, so they will be executed first (but you need to have table_A and table_B), and then table3 will be executed (this is a bit different if you use batch or streaming pipelines, as in streaming pipelines all nodes in the graph could be executed at the same time).

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • does that mean that the if else conditions in table3 do not have any effect at all? – JJ Kam Aug 04 '22 at 16:30
  • it really depends on what these conditions are doing - if they are changing the schema of the resulting table based on the input schema, then you may get some problems - but to know this you need to provide more information – Alex Ott Aug 04 '22 at 16:56