AutoLoader
AutoLoader is a tool for automatically and incrementally ingesting new files from Cloud Storage (e.g. S3, ADLS), and can be run in batch or streaming modes.
If your "staging" dataset is just files in cloud storage, and not a Delta Lake table, then AutoLoader is the perfect and best solution for your use case.
For example, if your daily staging data is in S3 and JSON format, you can use AutoLoader to create a batch job that ingests only new data and then shuts down, like so in PySpark:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "s3://<path-to-store-schema>") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("s3://<source-data-with-nested-json>") \
.writeStream \
.option("checkpointLocation", "s3://<path-to-checkpoint>") \
# The AvailableNow trigger will ingest new data as of now, then stop.
.trigger(availableNow=True) \
.toTable("<catalog>.<schema>.<table-to-append-to>")
Note: you can use this in a traditional Databricks Job/Workflow, or in a Delta Live Tables (DLT) pipeline. More on AutoLoader with DLT: https://docs.databricks.com/ingestion/auto-loader/dlt.html
dlt.read_stream()
The dlt.read_stream()
method is meant only for use if you're using Delta Live Tables (DLT) to create your ETL/ELT pipeline.
While AutoLoader is meant for ingesting files from cloud storage, dlt.read_stream()
is specifically for streaming reads from an existing Table in your lakehouse.
For example, a common implementation of Medallion Architecture is to do create multiple tables in a single DLT pipeline like so:
- Bronze: use AutoLoader to ingest raw files from cloud storage into a Delta table (the bronze)
- Silver (cleaned): use
dlt.read_stream()
to read from the Bronze table, do some data cleansing / normalizing, and write to another table (the silver)
- Gold (aggregates): use
dlt.read()
or dlt.read_stream()
to read from the Silver table(s) and create business level aggregations of the data (the gold)
dlt.create_streaming_live_table()
This method is specifically for combination with Delta Change Data Capture (CDC) within the Delta Live Tables framework. CDC is a feature for you to automatically capture changes (inserts, updates, deletes) that occur against a particular Delta Lake table. https://docs.databricks.com/delta-live-tables/python-ref.html#change-data-capture-with-python-in-delta-live-tables
Note: The create_target_table()
and create_streaming_live_table()
functions are deprecated. Databricks recommends updating existing code to use the create_streaming_table()
function.
Continued...
One last remark in response to this statement:
would need some kind of watermark to "remember" the point where the data was last read
This is one of the primary purposes of Spark Structured Streaming's checkpointing and you do not need to build any custom/bespoke solution to that problem.
See the AutoLoader example above; specifically the combination of the .option("checkpointLocation", "s3://<path-to-checkpoint>")
and .trigger(availableNow=True)
lines is what configures the streaming query to save checkpoints to some cloud storage location and only process data that is available as of the time the query starts.