I haven't used Delta Lake's Change Data Feed
yet and I want to understand whether it could be relevant for us or not.
We have the following set-up:
- Raw Data (update events from DynamoDB) ends up in a Staging Area ->
- We clean the new data and append it to a Bronze Table ->
- We merge the appended data into a Silver Table, to represent the latest state ->
- We run SQL queries on top of the Silver Tables, joining and aggregating them, thus creating our Gold Tables
Currently we track new data by using Streaming Checkpoints. This is very efficient in the Bronze -> Silver stage, since it is append only.
To my understanding CDF could improve the performance of our Silver -> Gold jobs, since with Streaming Checkpoints you still have to read the whole parquet file if one line changed, with CDF you just read the table changes, correct?
Also, is there a reason to use CDF instead of streaming checkpoints in the Bronze -> Silver jobs?