0

Documentation seems to be very sparse on Spark SQL watermaking syntax. I am able to find plenty on PySpark watermarking, but I am trying to accomplish this in SQL for a Databricks Delta Live table. Below is the syntax I am using for said table that transfers data from the Bronze layer to the Silver layer:

CREATE TEMPORARY STREAMING LIVE VIEW table_staging
AS
SELECT 
    ..
FROM 
  STREAM(bronze.table)
UNION ALL
SELECT 
    ..
FROM 
  STREAM(bronze.table_historical) h
  LEFT JOIN STREAM(bronze.table_historical2) v
    ON h.id = v.id
  LEFT JOIN STREAM(bronze.table_historical3) i 
    ON h.id = i.id
  LEFT JOIN STREAM(bronze.table_historical4) c 
    ON h.id = c.id

CREATE OR REFRESH STREAMING TABLE table 
TBLPROPERTIES ("quality" = "silver", "pipelines.autoOptimize.zOrderCols" = "GroupId");

APPLY CHANGES INTO LIVE.table
FROM STREAM(LIVE.table_staging)
KEYS (id)
IGNORE NULL UPDATES
SEQUENCE BY dateCreated
STORED AS SCD TYPE 1; 

The historical tables will never be updated again; they exist for the purpose of backfilling data from a legacy system. I am afraid if I don't use the 'stream' syntax when joining to them, however, that the each table will be scanned every time the DLT pipeline is run. Below is the error I am receiving when trying to run the pipeline in its current state:

Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition; line 12 pos 2;

I'm unable to find any documentation on how applying the watermark would work. I know I could use PySpark to accomplish this, but I am hoping it is possible with SQL since that is how the other pipelines have been built.

work89
  • 75
  • 8

0 Answers0