Autoloader VS dlt.read_stream VS dlt.create_streaming_live_table

Question

I have two different use cases and I'm not sure which of the listed functions to use for each:

we have a staging dataset that always contains only 1 day of data. Every day I want to append this staged data to an incremental table.
I am working from an incremental table, I only want to load the new data that has been added since the last time I read from the table, process it, and append it to another incremental table (would need some kind of watermark to "remember" the point where the data was last read)

Which of the following solutions should I use for this (I am using Python)?

Or is it none of these?

score 1 · Accepted Answer · answered Jul 10 '23 at 01:08

AutoLoader

AutoLoader is a tool for automatically and incrementally ingesting new files from Cloud Storage (e.g. S3, ADLS), and can be run in batch or streaming modes.

If your "staging" dataset is just files in cloud storage, and not a Delta Lake table, then AutoLoader is the perfect and best solution for your use case.

For example, if your daily staging data is in S3 and JSON format, you can use AutoLoader to create a batch job that ingests only new data and then shuts down, like so in PySpark:

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  # The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "s3://<path-to-store-schema>") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .load("s3://<source-data-with-nested-json>") \
  .writeStream \
  .option("checkpointLocation", "s3://<path-to-checkpoint>") \
  # The AvailableNow trigger will ingest new data as of now, then stop.
  .trigger(availableNow=True) \
  .toTable("<catalog>.<schema>.<table-to-append-to>")

Note: you can use this in a traditional Databricks Job/Workflow, or in a Delta Live Tables (DLT) pipeline. More on AutoLoader with DLT: https://docs.databricks.com/ingestion/auto-loader/dlt.html

`dlt.read_stream()`

The dlt.read_stream() method is meant only for use if you're using Delta Live Tables (DLT) to create your ETL/ELT pipeline.

While AutoLoader is meant for ingesting files from cloud storage, dlt.read_stream() is specifically for streaming reads from an existing Table in your lakehouse.

For example, a common implementation of Medallion Architecture is to do create multiple tables in a single DLT pipeline like so:

Bronze: use AutoLoader to ingest raw files from cloud storage into a Delta table (the bronze)
Silver (cleaned): use dlt.read_stream() to read from the Bronze table, do some data cleansing / normalizing, and write to another table (the silver)
Gold (aggregates): use dlt.read() or dlt.read_stream() to read from the Silver table(s) and create business level aggregations of the data (the gold)

`dlt.create_streaming_live_table()`

This method is specifically for combination with Delta Change Data Capture (CDC) within the Delta Live Tables framework. CDC is a feature for you to automatically capture changes (inserts, updates, deletes) that occur against a particular Delta Lake table. https://docs.databricks.com/delta-live-tables/python-ref.html#change-data-capture-with-python-in-delta-live-tables

Note: The create_target_table() and create_streaming_live_table() functions are deprecated. Databricks recommends updating existing code to use the create_streaming_table() function.

Continued...

One last remark in response to this statement:

would need some kind of watermark to "remember" the point where the data was last read

This is one of the primary purposes of Spark Structured Streaming's checkpointing and you do not need to build any custom/bespoke solution to that problem.

See the AutoLoader example above; specifically the combination of the .option("checkpointLocation", "s3://<path-to-checkpoint>") and .trigger(availableNow=True) lines is what configures the streaming query to save checkpoints to some cloud storage location and only process data that is available as of the time the query starts.

Autoloader VS dlt.read_stream VS dlt.create_streaming_live_table

1 Answers1

AutoLoader

dlt.read_stream()

dlt.create_streaming_live_table()

Continued...

`dlt.read_stream()`

`dlt.create_streaming_live_table()`