0

I have implemented a merge operation for change data capture (CDC) between two tables, bronze and silver using Spark Databricks in Delta Lake. The bronze table has a total record count of 500 million to 1 billion and the silver table has 200-300 million records. The schema for both tables is provided below:

Bronze table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date. Partitioned by: store_id, campaign_date, job_run_date

Silver table: store_id, product_id, item_id, clicks, impressions, campaign_date, job_run_date, md5_hash_key. Partitioned by: store_id, campaign_date

The merge operation works as follows:

  • Select the highest job run date from the bronze layer for each store.
  • Generate an md5 hash for each record.
  • Perform the merge with the silver layer using the md5 hash and store_id. The code for the merge operation is as follows:
deltaTable.alias("s").merge(
    b_df.alias("b"),
    "s.md5_hash_key = b.md5_hash_key AND s.store_id = b.store_id") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

I used md5 and store_id in merge operation as an attempt to improve the performance. Overall time got reduced by 30% but its not satisfactory as I believe it should take much lesser time.

icyanide
  • 31
  • 5

0 Answers0