2

In my DLP pipeline, I have three layers - bronze, silver, and gold. The bronze layer reads JSON files from an S3 bucket, while the silver layer performs data processing tasks such as adding new columns. The gold layer is responsible for performing aggregations on the processed data.

I want to write the data from the gold layer of my DLP pipeline to a Kafka topic. However, since DLT doesn't support writeStream operations. I'm performing a readStream operation on the gold table and then trying to write the data to Kafka in a separate notebook. Since the gold table is a materialized view that is constantly being updated, my readStream code fails when I try to extract data from it. If I try to use the 'ignore changes' option to prevent this issue, my table ends up being duplicated.

What would be the most effective way to handle this?

LucasVaz97
  • 31
  • 2
  • It would be helpful to know if gold table is streaming table or is fully calculated. In fact it would be best to write sample code with these 3 table defined. – partlov Mar 30 '23 at 19:57
  • 1
    Can you elaborate on _"my readStream code fails when I try to extract data from it."_? A materialized view is a regular delta table so it should be perfectly fine to `readStream` from it. Have you considered a Databricks Job with two steps - the DLT pipeline followed by the `readStream` code? – Jacek Laskowski Mar 31 '23 at 07:38
  • @JacekLaskowski My pipeline is running in continuous mode my gold table is always being updated. So if I try to readStream I get this error: `om.databricks.sql.transaction.tahoe.DeltaUnsupportedOperationException: Detected a data update (for example part-00000-92536d0c-4596-49c2-a54c-9e759aa37836-c000.snappy.parquet) in the source table at version 30. This is currently not supported.` I am already doing it in two steps the same way you suggested. – LucasVaz97 Mar 31 '23 at 17:12
  • Thanks @LucasVaz97. Can I also ask about how you do _"my gold table is always being updated"_? Are you `overwrite` as @partlov suggests in the answer? – Jacek Laskowski Apr 01 '23 at 12:48
  • 1
    @JacekLaskowski, since DLT tables are materialized views, they use overwrite automatically. You can confirm this by using the `ignoreChanges` function to stream the table and observe the consistent updates in its version – LucasVaz97 Apr 17 '23 at 14:19

1 Answers1

2

So, if you are changing data of table (in this case through overwrite) you can't read it as stream. There is another solution which may work, that is to use Change Data Feed (CDF). So basically you will be able to consume CDC-like events from gold delta table like you would get from CDC tools, like Debezium for example. Following steps should work:

  1. Enable CDF on gold table by setting table property delta.enableChangeDataFeed to true
  2. You can consume CDF as stream with following PySpark code:
(
  spark
  .readStream
  .format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 0)
  .table("gold_table_name")
)
  1. Beside columns from original table this stream will additionally have columns: _change_type, _commit_version and _commit_timestamp and you may want to filter or transform this stream before writing
  2. Write stream to Kafka in format you want

More documentation can be found here.

partlov
  • 13,789
  • 6
  • 63
  • 82