4

In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:

Countries:

ID Name
1  France

Towns:

ID Name  Country_ID
1  Lyon  1

Now lets say a new row is added to Towns table:

ID Name  Country_ID
1  Lyon  1
2  Paris 2

The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.

Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.

Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111
rlyzwa
  • 63
  • 1
  • 5

2 Answers2

2

This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.

There's multiple ways to solve this.

One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).

Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).

N West
  • 6,768
  • 25
  • 40
0

There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...

1st Pass is where everything matches via the join... 
Union All 
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that 
requires the ID in the countries table to be null/missing).

You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".