I have implemented a merge operation for change data capture (CDC) between two tables, bronze and silver using Spark Databricks in Delta Lake. The bronze table has a total record count of 500 million to 1 billion and the silver table has 200-300 million records. The schema for both tables is provided below:
Bronze table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date. Partitioned by: store_id, campaign_date, job_run_date
Silver table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date, md5_hash_key. Partitioned by: store_id, campaign_date
The merge operation works as follows:
- Select the highest job run date from the bronze layer for each store.
- Generate an md5 hash for each record.
- Perform the merge with the silver layer using the md5 hash and store_id. The code for the merge operation is as follows:
deltaTable.alias("s").merge(
b_df.alias("b"),
"s.md5_hash_key = b.md5_hash_key AND s.store_id = b.store_id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
I used md5 and store_id in merge operation as an attempt to improve the performance. Overall time got reduced by 30% but its not satisfactory as I believe it should take much lesser time.