The problem is that you're structured your DLT program incorrectly. Programs that are written for DLT should be declarative by design, but in your case you're performing your actions on the top-level, not inside the functions marked as @dlt.table
. When DLT pipeline is starting, it's building the execution graph by evaluating all code, and identifying vertices of the execution graph that are marked with @dlt
annotations (you can see that your function is called several times, as explained here). And because your code is having side effect of reading all data with spark.read.jdbc
, interacting with Cosmos, etc., then the initialization step is really slow.
To illustrate the problem let look onto your code structure. Right now you have following:
def read(...):
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
3. Return annotated function that will just return captured `df`
As result of this, items 1 & 2 are performed during the initialization stage, not when the actual pipeline is executed.
To mitigate this problem you need to change structure to following:
def read(...):
1. Return annotated function that will:
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB