0

My ultimate goal is to have a copy (in delta format) of an append-only incremental table that is in JDBC (SQL).

I have a batch process reading from the incremental append-only JDBC (SQL) table, with spark.read (since .readStream is not supported for JDBC sources). Every day, the most recent day of data is saved as a delta table. Note, this is not an append-only delta table - rather it is overwritten every day with the most recent day of data.

What I think I need next is spark.readStream out of the delta table, with the .option("skipChangeCommits", "true"). The Databricks Documentation here outlines exactly this.

I have selected Preview channel in pipeline settings.

The code for the final streaming table in my DLT pipeline is:

@table(
    name='streaming table',
)
def create_df():
    return spark.readStream.option("skipChangeCommits", "true").table("daily_batch")

However the error here is NameError: name 'table' is not defined,None,Map(),Map(),List(),List(),Map())

In case it is a typo in the documentation I have also tried with dlt.table and the error is:

pyspark.errors.exceptions.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view "daily_batch" cannot be found. Verify the spelling and correctness of the schema and catalog.

Oliver Angelil
  • 1,099
  • 15
  • 31

2 Answers2

1

Few things:

  • instead of @table use @dlt.table (until you did from dlt import table)
  • instead of daily_batch as table name you need to use fully-qualified name - three (catalog.schema.table) or two levels (schema.table) (depends on if you use Unity Catalog or not)

But are really sure that it will help you? The doc says:

skipChangeCommits disregards file changing operations entirely. Data files that are rewritten in the source table due to data changing operation such as UPDATE, MERGE INTO, DELETE, and OVERWRITE are ignored entirely. In order to reflect changes in upstream source tables, you must implement separate logic to propagate these changes.

Right now you may use a bit different approach (not tested, but should work) - not overwrite the data, but truncate & then append.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks - taking your tip to delete then append data from the "intermediate" daily delta table. I can load the table as a delta table with `dt=delta.DeltaTable.forPath` and then delete all rows with `dt.delete()`, but how can I append new rows and then still return a table from the function as is required in DLT (with append mode you must specify a save path) – Oliver Angelil Jul 20 '23 at 15:36
  • the first table in DLT must use `spark.read` and not `spark.readStream` since the source data is JDBC; and the next table must use `spark.readStream` in order to keep appending daily? – Oliver Angelil Jul 20 '23 at 16:01
  • I thought that you have JDBC reading as a separate job... Not sure how it could behave if you put everything into DLT – Alex Ott Jul 20 '23 at 17:47
0

My solution to this was to simply use a regular job, not DLT. This simplifies a lot, as there is no need for an intermediate table. It was just:

[JDBC incremental SQL] --Databricks-job--> [delta table write with append only mode].

I saved the watermark as metadata in the delta table with

.option("userMetadata", watermark.strftime("%Y-%m-%dT%H:%M:%S"))`

And read it with each subsequent batch run:

dt = delta.tables.DeltaTable.forPath(spark, path)
watermark = dt.history().select(F.col("userMetadata")).first()[0]

I used spark.read.format("sqlserver") to query from the JDBC server.

Oliver Angelil
  • 1,099
  • 15
  • 31