Logic to find out deltas/CDC records for capturing Just Inserts or Updates in Bigquery

Question

I have some raw data sitting in Big-query tables which are truncate load and my daily ETL feed which runs on these raw BQ tables is a daily snapshot of agents which are the daily extracts below .

To give more background ,I am trying to implement a Insert only table to implement this concept of Virtualized SCD type2 logic .This article focuses on implementing SCD type 2 with delta extracts directly.

my requirement is to design a logic/code to figure out this "Insert only" daily extracts so that I can build that virtualized SCD 2 table. I am thinking to put every daily extract in its own daily partition in Big-query table so that I have all the daily changes in one final table to build this view on.

What is an efficient logic/code/design to find the delta extracts everyday and save it in a table of inserts (may be partioned -see final table) in big query ?

Daily extract on 2022-03-01

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josh	abcde	2022-03-01 04:14:06

Delta Extract on 2022-03-01 should look like

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josh	abcde	2022-03-01 04:14:06

Daily extract on 2022-03-02 (New record for 12346)

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josh	abcde	2022-03-02 04:14:06
12346	Mark	fghij	2022-03-02 04:14:06

Delta Extract on 2022-03-02 should look like(should Only capture changes)

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12346	Mark	fghij	2022-03-02 04:14:06

Daily extract on 2022-03-03 (updated record for 12345)

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josher	mnopi	2022-03-03 04:14:06
12346	Mark	fghij	2022-03-02 04:14:06

Delta Extract on 2022-03-03 should look like(should only capture changes)

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josher	mnopi	2022-03-03 04:14:06

I have to build this final table of Inserts(new +changed records) by appending everyday's delta extract so that I can build a view to calculate my row_end_eff_ts

Final table of Inserts should look like

Agent_Key	Agent_name	MD5_CD	row_eff_ts
12345	Josh	abcde	2022-03-01 04:14:06
12346	Mark	fghij	2022-03-02 04:14:06
12345	Josher	mnopi	2022-03-03 04:14:06

@MikhailBerlyant - Sorry to bother you , Any ideas to implement this ? This is what I have in mind .Please correct me if wrong 1) — cheapcoder, Mar 01 '22 at 22:57
@MikhailBerlyant - 1)Every snaphot will be inserted as a full extract into its own partition inside a bq table A (2022-03-01 partition ,2022-03-02 partition and so on and so forth ..) 2)I will compare the daily extract partitions (2022-03-01 ,2022-03-02 ) by left joining 2022-03-02 partition of table A with 2022-03-01 partion of table A based on md5 and insert only the delta records into table B which now only has delta snapshots (of only inserts and updates) .3)This table B serves as an append only partitioned table with each partition being delta/cdc of 03/02 and 03/01 with only inserts — cheapcoder, Mar 01 '22 at 23:07

score 0 · Answer 1 · answered Mar 11 '22 at 05:34

A couple comments on your question first:

Daily extract on 2022-03-03 shows row_eff_ts as "2022-03-02" for Mark. I'm assuming this is a typo and should be "2022-03-03".
I think you're misunderstanding MD5_CD. According to your link, the hash is meant to be a consistent hash of business keys, which shouldn't change based on user updates. I think Agent 12345 should still have MD5_CD="abcde".

The query below runs over the full set of daily extracts to produce all of the deltas. You can add additional WHERE clauses over row_eff_ts to restrict it to a given day, if you want to process exactly one day at a time.

-- create sample data
create temp table daily_extracts (
    Agent_Key INT64,
    Agent_name STRING,
    MD5_CD STRING,
    row_eff_ts TIMESTAMP
);

insert daily_extracts (Agent_Key, Agent_name, MD5_CD, row_eff_ts)
VALUES
(12345, "Josh", "abcde", "2022-03-01 04:14:06"),
(12345, "Josh", "abcde", "2022-03-02 04:14:06"),
(12346, "Mark", "fghij", "2022-03-02 04:14:06"),
(12345, "Josher", "mnopi", "2022-03-03 04:14:06"),
(12346, "Mark", "fghij", "2022-03-03 04:14:06");

-- Calculate previous_Agent name. This could be combined with the 
-- query below, but this makes it easier to build the query 
-- incrementally IMO, without copying/pasting SQL as much.
with daily_data_and_prev_name as (
    select 
        *, 
        lag(Agent_name) over 
            (partition by Agent_Key order by row_eff_ts) previous_Agent_name,
    from daily_extracts
)

select * except (previous_Agent_name), 
from daily_data_and_prev_name
where 
  -- This finds name changes.
  Agent_name != previous_Agent_name 
  -- This finds new records.
  or previous_Agent_name is null
order by row_eff_ts
;

Output of the last query:

Row     Agent_Key   Agent_name  MD5_CD  row_eff_ts
1       12345       Josh        abcde   2022-03-01 04:14:06 UTC
2       12346       Mark        fghij   2022-03-02 04:14:06 UTC
3       12345       Josher      mnopi   2022-03-03 04:14:06 UTC

Logic to find out deltas/CDC records for capturing Just Inserts or Updates in Bigquery

1 Answers1