ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question

Question

main question

How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?

rationale

We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.

We're ripping daily full extracts because:

we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.

challenge question

Is there a solution that could give me the state as of a specific date and not just the "newest" state?

existential question

Am I thinking about this completely backwards and there's a much easier way to do this?

Possible Approaches

custom `dbt` materialization

There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:

run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?

Delta Lake

I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?

Fix the extraction job

Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?

Example

Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv). I added the Extracted column in case parsing the timestamp from the filename is too tricky.

`2020-09-14_CRM_extract.csv`

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/14      |
| 2     | B      | 3 - Propose |     | 9/12         | 9/14      |

`2020-09-15_CRM_extract.csv`

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/15      |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/15      |
| 3     | C      | 1 - Lead    |     | 9/14         | 9/15      |

`2020-09-16_CRM_extract.csv`

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/16      |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/16      |
| 3     | C      | 2 - Qualify |     | 9/15         | 9/16      |

End Result

Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null

| OppId | CustId | Stage       | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/14       | null     |
| 2     | B      | 3 - Propose |     | 9/12         | 9/14       | 9/15     |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/15       | null     |
| 3     | C      | 1 - Lead    |     | 9/14         | 9/15       | 9/16     |
| 3     | C      | 2 - Qualify |     | 9/15         | 9/16       | null     |

Is there a retention period that you'd like to maintain this historical data? You could easily leverage time-travel in Snowflake for this if your retention is less than 90 days, which I realize isn't common. — Mike Walton, Sep 17 '20 at 02:05
@MikeWalton great question. While I’m sure some scenarios are only for 90 days, most stakeholders require 2 years for year-over-year KPIs, and the greedy ML team wants all history for all time (lol). — Anders Swanson, Sep 17 '20 at 02:12
Pretty typical. I've also seen a rolling 13-month retention for year-over-year done each month, but same deal. — Mike Walton, Sep 17 '20 at 02:47
Hi - could you clarify what you mean by "ephemeral data warehouse"? I can understand why you might generate an "ephemeral data mart" off a permanent data warehouse but I've never come across the concept of an ephemeral data warehouse. If you are "just" trying to build a data warehouse then the situation you describe is a standard pattern: you compare your source file with your dimension table and insert records that don't exist and update those that do. Given that this is such a common scenario, I'm guessing I'm missing what your actual issue is? If so, could you clarify please? — NickW, Sep 17 '20 at 09:02
FYI - it is "best practice" set your valid_to values to a date far in the future rather than to nulls. This makes querying easier/more performant as you can use "where date between valid_from and valid_to" rather than "where date between valid_from and valid_to or (date >= valid_from and valid_to is null)" — NickW, Sep 17 '20 at 09:06
@NickW the idea is -- what if instead of a maintaining a core data warehouse then providing downstream marts off the the DW, we maintain all source data in a lake, and have all the code that spins up a DW or DB when the stakeholders need it. In this way, we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management . — Anders Swanson, Sep 17 '20 at 19:51
My response is too long for a comment so adding it as an answer... — NickW, Sep 18 '20 at 09:40

score 2 · Accepted Answer · answered Sep 18 '20 at 09:59

Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.

My concerns would be:

If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.

Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?

What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load

score 1 · Answer 2 · answered Sep 17 '20 at 02:58

1

I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.

Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.

answered Sep 17 '20 at 02:58

Mike Walton

6,595
2
11
22

totally. but let's say this SCD-II process was running for a few months, but then gets deleted overnight. How could I recreate it exactly as it was before it got deleted, but referencing the source files. My idea is too loop through each csv chronologically and run an update on each file? My concern is that seems to be a heavy overload and not conducive to parallelization. – Anders Swanson Sep 17 '20 at 19:47
No, THIS is where your time-travel in Snowflake comes into play. Just recreate the table using time-travel at the point where the table was deleted. – Mike Walton Sep 17 '20 at 23:30
1

Now, if you wanted a purely hypothetical rebuild, you could use an external table over every file you have and then use window functions and a hash to recreate your table. This can be done in a single statement, window functions to choose records where the hash is different from the current hash using `LAG` and then choose the `LAG` value for your valid_to date. – Mike Walton Sep 17 '20 at 23:33
yeah so time-travel is off the table bc we need to maintain years of history. the external table per file idea is interesting. I think I could use `dbt` to make this happen imperatively with Jinja – Anders Swanson Sep 18 '20 at 02:18
When I mentioned time-travel, I was referring to "recovering from a case where your table got deleted". Rather than rebuild the whole table, you could just use time-travel to recover from that deletion. – Mike Walton Sep 19 '20 at 15:12