Key Issue:
CDC (preview) in ADF has no "Vacuum" or "Retention Period" setting to delete outdated parquet versions / trim Delta Logs for Delta Tables
I am using Azure Data Factory's Change Data Capture feature (currently in Preview) to incrementally refresh a Delta Table I initialized via a "Copy Data" activity from Azure SQL Database to Gen2 Data Lake.
The refresh works well. The Delta Table receives incremental updates from a "Last Updated" field, and an "ID" column whenever a change in the source data (Azure SQL Database) is detected. Options to choose "Keys" and "Updated" fields were explicitly given in the feature, allowing me to successfully sync the databases.
Delta Tables, however, store a lot of data. Including every version of the table (current and former) in parquet, as well as log files in a "_delta_log" folder. Unfortunately, I did not see an option to set a "Retention Period" (to prune log files) or a "Vacuum" setting (to delete unused tables after a certain date). These settings can easily be set in Delta Tables managed by Data Bricks, with delta.logRetentionDuration = “interval <interval>
and delta.deletedFileRetentionDuration = "interval <interval>
commands. However, in Azure Data Factory, the answer does not appear to be that straightforward.
I do not have Data Bricks, but I like the Delta Format so I can incrementally refresh tables with "upsert".
Question: How do I set the RETENTION PERIOD and VACUUM interval using Azure Data Factory?
Workaround (Partial Success):
Use a Data Flow to Incrementally Refresh using Change Data sent to a Separate Storage Location
I have had some success with a workaround. Instead of directly refreshing data from Azure SQL Database to Delta Table stored in Gen2 Data Lake account, I readjusted the CDC (preview) to instead send change files to a separate storage location in my Gen2 Data Lake container. I then set up a trigger to scan the storage location with change data parquets for new files. When new ones drop, a Data Flow activity is executed that reads the change data parquets and "upserts" the changes into the existing Delta Table (using an Alter Row command).
- SUCCESS! Here, in the sink (writing back to Delta Table) I can set the Vacuum interval to "1" (one day)
- FAILURE There is no option to set the Retention Period so to trim old Log files
For full success I will need both a VACUUM setting and a RETENTION PERIOD setting. My workaround only solves for the VACUUM setting.
Simpler the Better: Solution to Keep in CDC (Preview) End-to_End?
Ideally, I would like to keep with the CDC (Preview) handling the full incremental refresh end-to-end, to save money on not having to run a separate pipeline with DataFlow activity. But, again, I will need to have the ability to adjust both the VACUUM interval and RETENTION PERIOD.
Thanks in advance for your help!