Documentation seems to be very sparse on Spark SQL watermaking syntax. I am able to find plenty on PySpark watermarking, but I am trying to accomplish this in SQL for a Databricks Delta Live table. Below is the syntax I am using for said table that transfers data from the Bronze layer to the Silver layer:
CREATE TEMPORARY STREAMING LIVE VIEW table_staging
AS
SELECT
..
FROM
STREAM(bronze.table)
UNION ALL
SELECT
..
FROM
STREAM(bronze.table_historical) h
LEFT JOIN STREAM(bronze.table_historical2) v
ON h.id = v.id
LEFT JOIN STREAM(bronze.table_historical3) i
ON h.id = i.id
LEFT JOIN STREAM(bronze.table_historical4) c
ON h.id = c.id
CREATE OR REFRESH STREAMING TABLE table
TBLPROPERTIES ("quality" = "silver", "pipelines.autoOptimize.zOrderCols" = "GroupId");
APPLY CHANGES INTO LIVE.table
FROM STREAM(LIVE.table_staging)
KEYS (id)
IGNORE NULL UPDATES
SEQUENCE BY dateCreated
STORED AS SCD TYPE 1;
The historical tables will never be updated again; they exist for the purpose of backfilling data from a legacy system. I am afraid if I don't use the 'stream' syntax when joining to them, however, that the each table will be scanned every time the DLT pipeline is run. Below is the error I am receiving when trying to run the pipeline in its current state:
Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition; line 12 pos 2;
I'm unable to find any documentation on how applying the watermark would work. I know I could use PySpark to accomplish this, but I am hoping it is possible with SQL since that is how the other pipelines have been built.