I have a requirement to implement a SCD Type 2 in my Delta Tables. The Scenario is as under.
Source Table Columns are:
---------------------------------------------------
state. Code. Name. value. insert_datetime
---------------------------------------------------
TX. AX transport 2.2 2023-01-01
AL BB Mining 1.1 2023-01-03
BY XC Infras. 4.1 2022-06-06
JJ NR Educ 5.6 2023-08-10
Destination Table (Silver) is as under:
--------------------------------------------------------------------------------
state. Code. Name. value. insert_datetime. is_current. ExpiryDate
--------------------------------------------------------------------------------
TX. AX transport 2.2 2023-08-14 Y 9999-12-31
AL BB Mining 1.1 2023-08-14. Y 9999-12-31
BY XC Infras. 4.4 2023-08-14. N 2023-08-14
BY XC Infras. 4.1 2023-08-14. Y 9999-12-31
JJ NR Educ 5.6 2023-08-14 Y 9999-12-31
What i want to achieve is to Insert the new records, delete the missing records and update the existing records marking the previous one as inactive and inserting the new one (as for state BY).
I want to achieve this using either Python or PySpark on a databricks platform and have used a few approaches but being new to SDC Type 2 i couldn't figure out how to achieve this.
All examples i have seen show a surrogate key and a merge key, One of my questions is what is the need of a separate surrogate key and merge_key.
I am thinking of creating a surrogate key based upon my key columns [state, code, name, value] in both dataframes (source and target) but i am not sure how to achieve the results end to end.