1

I have a snowflake table (called as original_table) with following fields case_num, code, project_name, sp_id, updated_date. For each case_num the combination of code, project_name, sp_id will be unique.

There is another table (called as additional_table) that has only the case_num, timestamp column. It doesnt have the other key columns like original_table. Its a truncate and loaded table daily.

I should check two conditions to mark the D flag (soft delete):

  • If there is any change in either of code OR project_name OR sp_id of the current_row vs previous_row for a particular case_num then the previous_row should be soft deleted with d flag. (AND)

  • If the case_num did not arrive in additional_table for today's load, then I will need to mark the record as soft delete (D).

The problem I face is that it is hard to mark the deletes accurately since the additional_table doesnt have the other key columns code, project_name, sp_id . So, I manually tried to mark the deleted record using below logic.

WITH src AS (
  SELECT *,
         LAG(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_code,
         LAG(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_project_name,
         LAG(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS prev_sp_id,
         LEAD(code) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_code,
         LEAD(project_name) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_project_name,
         LEAD(sp_id) OVER (PARTITION BY case_num ORDER BY updated_date) AS next_sp_id
  FROM original_table
),
act AS (
  SELECT case_num, MAX(updated_date) AS max_updated_date
  FROM additional_table
  GROUP BY case_num
)
SELECT src.*,
       CASE WHEN (
            (
              (prev_code IS NOT NULL AND (code <> prev_code OR code <> next_code))
              OR (prev_project_name IS NOT NULL AND (project_name <> prev_project_name OR project_name <> next_project_name))
              OR (prev_sp_id IS NOT NULL AND (sp_id <> prev_sp_id OR sp_id <> next_sp_id))
            )
            AND updated_date <> act.max_updated_date
            
           ) OR (ACT.case_num IS NULL)
           THEN LAG('D') OVER (PARTITION BY case_num ORDER BY updated_date) ELSE 'N' END AS soft_delete_flag
FROM src
LEFT JOIN act ON src.case_num = act.case_num;

In the below, the 3rd, 6th and 8th rows have changed key values (either one/all) for a particular case_num. So, the 2nd, 5th and 7th row that is previous to the updated valur/current row should be marked as D. Please refer output.

case_num code project_name sp_id updated_date
1234 NULL ABC 123 2023-01-01
1234 NULL ABC 123 2023-01-02
1234 Value ABC 123 2023-01-03
2345 NULL NULL 456 2023-01-01
2345 NULL ABC 456 2023-01-02
2345 NULL DEF 456 2023-01-03
7890 NULL NULL NULL 2023-01-01
7890 New_value BBB 678 2023-01-02

"Expected Output":

case_num code project_name sp_id updated_date soft_delete_flag
1234 NULL ABC 123 2023-01-01 N
1234 NULL ABC 123 2023-01-02 D
1234 Value ABC 123 2023-01-03 N
2345 NULL NULL 456 2023-01-01 N
2345 NULL ABC 456 2023-01-02 D
2345 NULL DEF 456 2023-01-03 N
7890 NULL NULL NULL 2023-01-01 D
7890 New_value BBB 678 2023-01-02 N

"Current Output":

case_num code project_name sp_id updated_date soft_delete_flag
1234 NULL ABC 123 2023-01-01 N
1234 NULL ABC 123 2023-01-02 N
1234 Value ABC 123 2023-01-03 D
2345 NULL NULL 456 2023-01-01 N
2345 NULL NULL 456 2023-01-02 N
2345 NULL DEF 456 2023-01-03 D
7890 NULL NULL NULL 2023-01-01 D
7890 New_value BBB 678 2023-01-02 D

As per above logic, I am getting the D records to the record row that has changed values, but not to the row previous to it.

Please could anyone help to modify this SNOWFLAKE query or any other approach to get the expected output. TIA.

**

  • Updated

**: Input scenario:

case_num code project_name sp_id updated_date
1234 PGK123 ABC NO_VALUE 2023-01-01
1234 PGK456 ABC NO_VALUE 2023-01-01
1234 PGK123 ABC NO_VALUE 2023-01-05
1234 PGK456 ABC NO_VALUE 2023-01-05
1234 PGK123 ABC NO_VALUE 2023-01-08
1234 PGK456 ABC NO_VALUE 2023-01-08

In the above scenario, 2 records are inserted on certain updated_date. So total of 3 pairs on 3 different dates. As per the conditions, there should not be any Deleted record and all of them should be 0. But the output shows like below with a deleted record with one of the record of the latest date

case_num code project_name sp_id updated_date soft_delete_flag
1234 PGK123 ABC NO_VALUE 2023-01-01 0
1234 PGK456 ABC NO_VALUE 2023-01-01 0
1234 PGK123 ABC NO_VALUE 2023-01-05 0
1234 PGK456 ABC NO_VALUE 2023-01-05 0
1234 PGK123 ABC NO_VALUE 2023-01-08 1
1234 PGK456 ABC NO_VALUE 2023-01-08 0
Anand2706
  • 57
  • 6

1 Answers1

1

The problem here is that your NULL values shall be matched as equal when consecutive nulls are found. One thing you can do to approach this issue is to coalesce your null values to temporary "NULL" for strings and "-1" for integers (assuming -1 is a value that the column "sp_id" cannot have).

COALESCE(        code, 'NULL') AS coal_c,
COALESCE(project_name, 'NULL') AS coal_pn,
COALESCE(       sp_id,     -1) AS coal_spid

And compute the values from the next rows accordingly.

LEAD(COALESCE(        code, 'NULL')) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
LEAD(COALESCE(project_name, 'NULL')) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
LEAD(COALESCE(       sp_id,     -1)) 
     OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid

Once you've done that, you can check when there's a difference between consecutive values for your three records, with a CASE expression.

CASE WHEN NOT coal_c = next_coal_c 
       OR NOT coal_pn = next_coal_pn 
       OR NOT coal_spid = next_coal_spid
     THEN 1 ELSE 0 END DESC

But we want to set only the latest change to 1 for the "soft_delete_flag" column, hence we use a ROW_NUMBER to order your records according to:

  • the previous generated flag descendently (values with 1 come first)
  • the date descendently
ROW_NUMBER() OVER(
    PARTITION BY case_num 
    ORDER BY CASE WHEN ...
                  THEN 1 ELSE 0 END DESC, 
             updated_date DESC
)

And when row number will be equal to 1, we will set 1, otherwise 0, with a CASE expression.

CASE WHEN ROW_NUMBER() OVER(
          ) = 1
     THEN 'D' 
     ELSE 'N' 
END AS soft_delete_flag

Here's the full code:

WITH cte AS (
    SELECT *, 
           COALESCE(        code, 'NULL') AS coal_c,
           COALESCE(project_name, 'NULL') AS coal_pn,
           COALESCE(       sp_id,     -1) AS coal_spid,
           LEAD(COALESCE(        code, 'NULL')) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_c,
           LEAD(COALESCE(project_name, 'NULL')) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_pn,
           LEAD(COALESCE(       sp_id,     -1)) 
               OVER(PARTITION BY case_num ORDER BY updated_date) AS next_coal_spid
    FROM tab
)
SELECT case_num,
       code,
       project_name,
       sp_id,
       updated_date,
       CASE WHEN ROW_NUMBER() OVER(
                     PARTITION BY case_num 
                     ORDER BY CASE WHEN NOT coal_c = next_coal_c 
                                     OR NOT coal_pn = next_coal_pn 
                                     OR NOT coal_spid = next_coal_spid
                                   THEN 1 ELSE 0 END DESC, 
                              updated_date DESC
                 ) = 1
            THEN 'D' 
            ELSE 'N' 
       END AS soft_delete_flag
FROM cte
lemon
  • 14,875
  • 6
  • 18
  • 38
  • Thanks @lemon. I tried that and it worked well for most of the scenarios. There is one another scenario like below, I will add that in the comment – Anand2706 Jun 06 '23 at 13:58
  • Added the sample date in my original post under tag "updated" @lemon. Thank you. – Anand2706 Jun 06 '23 at 14:05
  • As far as I know, there's no way to solve that problem easily. That database design is flawed, because you cannot uniquely identify your partitions with case_num, given that they can assume multiple values of code, project_name or sp_id in the same partition and on the same date. Can't do much about it, sorry. This is going to bring you in a very complex and inefficient query due to that flaw. – lemon Jun 06 '23 at 14:10
  • 1
    yes, @lemon. you are right. Thank you so much for the detailed code and help.. – Anand2706 Jun 06 '23 at 14:15
  • If this solution is working for most of your records, but does not fully solve your problem, consider leaving a +1 (voting up), as crafting it required quite some time and effort from this side. – lemon Jun 06 '23 at 14:33
  • sure, done. thanks @lemon. – Anand2706 Jun 07 '23 at 00:12