I have a data connection source that creates two datasets:
- Dataset X (Snapshot)
- Dataset Y (Incremental)
The two datasets pull from the same source. Dataset X
consists of the current state of all rows in the source table. Dataset Y
pulls all rows that have been updated since the last build. These two datasets are then merged downstream into dataset Z
with dataset Z
being either dataset X
or the most recent version of each row from dataset Y
. This allows us to both have low latency updates and maintain good partitioning.
When rows are deleted in the source table, the rows are no longer present in dataset X
but are still present in dataset Y
.
What would be the best way keep these 'deleted' rows in dataset Z
? Ideally I would also be able to snapshot dataset Y
without losing any of the 'deleted' rows.