1

I am loading data into Snowflake data vault modeled database. The model work as follows when a field of a row has been updated:

  1. Set the load end date of this row as equal to current_timestamp().
  2. Add a the same row one more time with the new values into the model.

I am using the merge command from Snowflake within a JavaScript procedure to do so:

var observarion_query = "MERGE INTO HUB_OBSERVATION AS OBS "+
"USING (SELECT DATE(T.$"+OBSERVATION_DATE+", 'DD/MM/YYYY') AS OBS_DATE, T.$"+LOCATIONS+", T.$"+SUBMISSION_TIME+" FROM "+FILE_FULL_PATH+"(FILE_FORMAT=>"+FILE_FORMAT_NAME+") T) ST "+
"ON md5(CONCAT(ST.OBS_DATE, CONCAT('CAMP', CONCAT(ST.$"+LOCATION_POSITION+", ST.$"+SUBMISSION_TIME+")))) = OBS.OBSERVATION_DATE_LOCATION_HASH_KEY "+
"WHEN MATCHED THEN UPDATE SET OBS.LOAD_END_DT = CURRENT_TIMESTAMP() "+
"WHEN NOT MATCHED THEN "+
"INSERT (OBSERVATION_DATE_LOCATION_HASH_KEY, LOAD_DT, LOAD_END_DT, RECORD_SRC, OBSERVATION_DATE, LOCATION_NAME) "+
"VALUES (md5(CONCAT(ST.OBS_DATE, CONCAT('CAMP', CONCAT(ST.$"+LOCATION_POSITION+", ST.$"+SUBMISSION_TIME+")))), current_timestamp(), NULL, 'ONA', ST.OBS_DATE, CONCAT('CAMP', ST.$"+LOCATION_POSITION+")) ";

The problem is within WHEN MATCHED THEN, I need to do an insert for the same row with it's new value and but with extra condition saying:

WHEN MATCHED and OBS.REVIEW_STATUS <> ST.REVIEW_STATUS THEN
// INSERT THE ROW

And I do really know that we can't perform an insert query within WHEN MATCHED THEN statement.

How can we find a turnaround to do so?

alim1990
  • 4,656
  • 12
  • 67
  • 130
  • Just to make sure, you know load_end_dt is deprecated in DV 2.0 ? – MLeblanc Jan 22 '21 at 05:02
  • @MLeblanc what is coming in place of her in DV 2.0 ? – alim1990 Jan 22 '21 at 06:49
  • @MLeblanc anyway we are not using it as a business information, we are jut using it to say that this submission is retired, we are not using it in our data queries that is connected to end-users dashboards. – alim1990 Jan 22 '21 at 07:23
  • check on this https://learndatavault.com/mis-understood-end-dates/ – alim1990 Jan 22 '21 at 07:23
  • yep, but the article your link refered to is this one : https://danlinstedt.com/allposts/datavaultcat/end_of_updates/ if you pay attention, it says : "implement record source tracking as defined in my book" The record source tracking satellites is what you need, it keeps track of which BK has been created/updated/deleted, so no need for load_end_dt anymore. – MLeblanc Jan 22 '21 at 17:59

2 Answers2

5

If I understand your question correctly, you want to have a single source row cause potentially 2 actions against the target table:

  • Update the LOAD_END_DT to the current timestamp (if the key already exists in the target table with LOAD_END_DT = NULL, which signifies that it is "current")
  • Insert a new row into the target table with the latest information

You can achieve this by "splitting" each source row inside of the USING clause into 2 rows using a UNION ALL: one row for the UPDATE and one row for the INSERT. I typically include a boolean flag to differentiate them (since they are otherwise duplicates). In the INSERT part, I do a SELECT ... WHERE NOT EXISTS (SELECT 1 FROM target WHERE key = key and MD5() = MD5()) so that the new row is inserted only if it is not already the same as the current row. My ON clause has a filter on the boolean representing the UPDATE scenario.

[Edited to include a sample MERGE]

First, assume the following stage and final table definitions:

CREATE OR REPLACE TRANSIENT TABLE T_STAGE (
  ID            INTEGER
 ,COL1          VARCHAR
 ,COL2          VARCHAR
 ,COL3          VARCHAR
)
;
CREATE OR REPLACE TRANSIENT TABLE T_FINAL (
  ID            INTEGER
 ,START_TS      TIMESTAMP_LTZ
 ,END_TS        TIMESTAMP_LTZ
 ,COL1          VARCHAR
 ,COL2          VARCHAR
 ,COL3          VARCHAR
 ,COL_MD5_HASH  VARCHAR
)
;

This MERGE illustrates the use of UNION ALL to split a single source row into 2, so that an INSERT and an UPDATE can be applied against the target table:

MERGE INTO T_FINAL AS TGT
  USING (
    WITH CTE_X AS (
      SELECT ID
            ,COL1
            ,COL2
            ,COL3
            ,MD5(ARRAY_TO_STRING(ARRAY_CONSTRUCT(ID, COL1, COL2, COL3), '^')) AS COL_MD5_HASH
        FROM T_STAGE
    )
    SELECT FALSE AS UPDATE_FLAG
          ,X.ID
          ,X.COL1
          ,X.COL2
          ,X.COL3
          ,X.COL_MD5_HASH
      FROM CTE_X X
     WHERE NOT EXISTS (
             SELECT 1
               FROM T_FINAL T2
              WHERE T2.COL_MD5_HASH = X.COL_MD5_HASH
           )
    UNION ALL
    SELECT TRUE AS UPDATE_FLAG
          ,X.ID
          ,X.COL1
          ,X.COL2
          ,X.COL3
          ,X.COL_MD5_HASH
      FROM CTE_X X
           JOIN T_FINAL T3
             ON T3.END_TS IS NULL
            AND T3.ID = X.ID
            AND T3.COL_MD5_HASH != X.COL_MD5_HASH
  ) AS SRC
  ON TGT.END_TS IS NULL
 AND SRC.ID = TGT.ID
 AND SRC.UPDATE_FLAG
 WHEN NOT MATCHED THEN INSERT (ID, START_TS, END_TS, COL1, COL2, COL3, COL_MD5_HASH)
   VALUES (SRC.ID, CURRENT_TIMESTAMP(), NULL, SRC.COL1, SRC.COL2, SRC.COL3, SRC.COL_MD5_HASH)
 WHEN MATCHED THEN UPDATE SET END_TS = CURRENT_TIMESTAMP()
;

There are many assumptions and variations depending on your specs. For example, the NOT EXISTS can be removed if the stage table rows are purged after every MERGE... it is only there to avoid INSERTing the same stage row multiple times. You would have to adjust to match your specs. This is only provided for illustrative purposes (since you asked).

Darren Gardner
  • 1,094
  • 4
  • 6
1

The logic (or turnaround) can be:

  1. Setting the load_end_dates for updated rows and inserting completely new rows:

    MERGE ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT

  2. Inserting the updated rows with their new validfrom/validtos, something like:

    INSERT INTO target_table JOIN source_table ON target_table.key = source_table.key WHERE target_table.col1<>source_table.col1, ... (so you have to identify the updated rows here)

If you want to insert the updated rows, you have to identify them. This is done with a comparison between your target_table and your source_table. How to identify updated rows:

  1. Join target_table and source_table by using the key of the tables (if you want to join by business key)
  2. Filter those rows where a column has changed --> add a WHERE-clause that checks something like source.colA != target.colA or source.colB != target.colB and so on
  3. Insert the result of this select-statement to your Satellite

Another hint may be using a Changed hash key and just check whether there is a different change hash for the latest record of a business entity, see here: https://www.hansmichiels.com/2016/04/09/hash-diff-calculation-with-sql-server-datavault-series/

Marcel
  • 2,454
  • 1
  • 5
  • 13