1

I have the following code which works fine for a single table. But when I try to use a for..loop() to process all the tables in my database, I am getting the error, "AnalysisException: Cannot redefine dataset 'source_ds',Map(),Map(),List(),List(),Map())".

I need to pass the table name to source_ds so as to process CDC based on key & sequence_columns. Appreciate any help/suggestions please.

import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
raw_db_name = "raw_db"

def generate_silver_tables(target_table, source_table, keys_col_list):

 @dlt.table
 def source_ds():
        return spark.table(f"{raw_db_name}.{source_table}")

  ### Create the target table definition
 dlt.create_target_table(name=target_table,
 comment= f"Clean, merged {target_table}",
 #partition_cols=["topic"],
 table_properties={
   "quality": "silver",
   "pipelines.autoOptimize.managed": "true"
 }
 )
  
 ## Do the merge
 dlt.apply_changes(
   target = target_table,
   source = "source_ds",
   keys = keys_col_list,
   apply_as_deletes = expr("operation = 'DELETE'"),
   sequence_by = col("ts_ms"),
   ignore_null_updates = False,
   except_column_list = ["operation", "timestamp_ms"],
   stored_as_scd_type = "1"
 )
 return

# THIS WORKS FINE
#---------------
# raw_dbname = "raw_db"
# raw_tbl_name = 'raw_table'
# processed_tbl_name = raw_tbl_name.replace("raw", "processed")
# generate_silver_tables(processed_tbl_name, raw_tbl_name)


table_list = spark.sql(f"show tables in landing_db ").collect()
for row in table_list:
    landing_tbl_name = row.tableName
    s2 = spark.sql(f"select key from {landing_db_name}.{landing_tbl_name} limit 1")
    keys_col_list = list(json.loads(s2.collect()[0][0]).keys())
    raw_tbl_name = landing_tbl_name.replace("landing", "raw")
    processed_tbl_name = landing_tbl_name.replace("landing", "processed")
    generate_silver_tables(processed_tbl_name, raw_tbl_name, keys_col_list)
#     time.sleep(10)
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Yuva
  • 2,831
  • 7
  • 36
  • 60

1 Answers1

1

You need to give unique names to each table by providing name attribute to the dlt.table annotation for source table, and then use the same name in the apply_changes. Otherwise it will be take from the function name and fail because you already defined that function. Like this:

def generate_silver_tables(target_table, source_table, keys_col_list):

 @dlt.table(
    name=source_table
 )
 def source_ds():
        return spark.table(f"{raw_db_name}.{source_table}")

  ### Create the target table definition
 dlt.create_target_table(name=target_table,
 comment= f"Clean, merged {target_table}",
 #partition_cols=["topic"],
 table_properties={
   "quality": "silver",
   "pipelines.autoOptimize.managed": "true"
 }
 )
  
 ## Do the merge
 dlt.apply_changes(
   target = target_table,
   source = source_table,
   keys = keys_col_list,
   apply_as_deletes = expr("operation = 'DELETE'"),
   sequence_by = col("ts_ms"),
   ignore_null_updates = False,
   except_column_list = ["operation", "timestamp_ms"],
   stored_as_scd_type = "1"
 )
 return

See DLT Cookbook for full example.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thank you Alex that solved the error, and able to run the job successfully. But when I try to run second time, am getting "org.apache.spark.sql.streaming.StreamingQueryException: Query terminated with exception: Detected a data update () in the source table at version 3. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory." ? Can you pls suggest, I can open new question. – Yuva Feb 15 '23 at 13:38
  • 1
    It looks like you're trying to consume from one of the target tables - this is not supported yet. Or you may get changes in the one of the source tables - in this case you can either add `ignoreChanges` option as described in the error message, or use Change Data Feed functionality. But first you need to find what table causes that – Alex Ott Feb 15 '23 at 13:50
  • Hi @Alex Ott, thanks for your reply, but what I tried was simply re-run the pipeline, just to see if its working fine. So what config. is required to fetch all updated records only if i rerun it again? Do I need to make any configuration chg while creating my source tables? Please advise, sorry am new to this DLT, streaming data and CDC process. – Yuva Feb 16 '23 at 06:19
  • And if I add ignoreChanges option with TRUE, it will reprocess entire dataset, Correct? But I need to process only the delta/updated records. Thank you very much – Yuva Feb 16 '23 at 06:21
  • 1
    these are orthogonal things. `ignoreChanges` just says - return only appends. If you need to understand delta piece, like, updates & deletes, then change data feed will help you, but only for first level of source tables. – Alex Ott Feb 16 '23 at 07:08
  • 1
    also - only updaes vs full reprocessing depends on how you define tables. If you use `.readStream` then it will handle only updates. In your case you use `.table` function that causes full read of the data – Alex Ott Feb 16 '23 at 07:20