1

I've a DLT pipeline where in it creates Delta table by reading from sql server and then we call few apis to update metadata in our cosmos. Whenever we start it, it gets struck in initialising state. But when we run same code using interactive cluster in a stand alone notebook, it works fine. Can someone help me to understand this issue ?

DLT Pipeline shouldn't struck in initialising state

enter image description here

Ravindra
  • 31
  • 2
  • Please add more information with your code how you read the data, what are you doing outside of the `@dlt.table`, etc. – Alex Ott Nov 11 '22 at 09:15
  • @AlexOtt sorry I missed it. – Ravindra Nov 28 '22 at 06:22
  • @AlexOtt We've defined a func that reads table from sql server using JDBC. Then we would be making couple of api calls to our cosmos db, first one is jus reading some configuration from cosmos for every table and then an other call to write metadata to cosmos. Then create dlt table in a traget database. We call this func concurrently using either Threading module or concurrent.futures module as we need to ingest mul entities into our lakehouse. Using standalone (without dlt), this works fast, but in DLT it gets struck initialising, after 10-15 mins it goes setting up tables and Running. – Ravindra Nov 28 '22 at 06:37
  • it's hard to say without looking into the code. Most probably your code has some side effects that affect the way how DLT operates – Alex Ott Nov 28 '22 at 07:52
  • @AlexOtt added in question itself, please note - get_entity_configuration_params and write_to_cosmos are 2 diff functions defined which makes api calls to cosmos. – Ravindra Nov 28 '22 at 11:46
  • and you don't have `@dlt.table` annotation on the `read` function? – Alex Ott Nov 28 '22 at 14:44
  • read function is where core logic happens, within that I'm using dlt annotation called '@dlt.create_table' which is same as '@dlt.table'. This read function would be called concurrently. – Ravindra Nov 29 '22 at 06:58

1 Answers1

0

The problem is that you're structured your DLT program incorrectly. Programs that are written for DLT should be declarative by design, but in your case you're performing your actions on the top-level, not inside the functions marked as @dlt.table. When DLT pipeline is starting, it's building the execution graph by evaluating all code, and identifying vertices of the execution graph that are marked with @dlt annotations (you can see that your function is called several times, as explained here). And because your code is having side effect of reading all data with spark.read.jdbc, interacting with Cosmos, etc., then the initialization step is really slow.

To illustrate the problem let look onto your code structure. Right now you have following:

def read(...):
  1. Perform read via `spark.read.jdbc` into `df`
  2. Perform operations with Cosmos DB
  3. Return annotated function that will just return captured `df`

As result of this, items 1 & 2 are performed during the initialization stage, not when the actual pipeline is executed.

To mitigate this problem you need to change structure to following:

def read(...):
  1. Return annotated function that will:
    1. Perform read via `spark.read.jdbc` into `df`
    2. Perform operations with Cosmos DB
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Initially we tried with the approach you had suggested, but pipeline execution taking its own time (same code thats in question with structure you suggested) to complete. Its not stable. Sometimes it takes 24 mins and sometimes 14 mins, in few cases it went to 40 mins too. But after changing to the current structure, the execution is kind of stable in all runs max 14-15 mins. Please note- In all runs, same number of tables and same tables are selected. – Ravindra Nov 30 '22 at 15:49