How can I improve the performance of a CDC merge operation between two tables in Spark DeltaLake?

Question

I have implemented a merge operation for change data capture (CDC) between two tables, bronze and silver using Spark Databricks in Delta Lake. The bronze table has a total record count of 500 million to 1 billion and the silver table has 200-300 million records. The schema for both tables is provided below:

Bronze table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date. Partitioned by: store_id, campaign_date, job_run_date

Silver table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date, md5_hash_key. Partitioned by: store_id, campaign_date

The merge operation works as follows:

Select the highest job run date from the bronze layer for each store.
Generate an md5 hash for each record.
Perform the merge with the silver layer using the md5 hash and store_id. The code for the merge operation is as follows:

deltaTable.alias("s").merge(
    b_df.alias("b"),
    "s.md5_hash_key = b.md5_hash_key AND s.store_id = b.store_id") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

I used md5 and store_id in merge operation as an attempt to improve the performance. Overall time got reduced by 30% but its not satisfactory as I believe it should take much lesser time.

How can I improve the performance of a CDC merge operation between two tables in Spark DeltaLake?

0 Answers0