I have a SCD2 dimension in our snowflake implementation, something like this
DEAL_SIDE_LEG_DIM
- DIM_DEAL_ID Number Surrogate PK
- DEALNO Business key
- DEALSIDE Business key
- DEALLEG Business key
- various attributes
- VALID_FROM date
- VALID_TO date
NB - this is a denormalised hierarchy of DEALNO / DEALSIDE / DEALLEG
Where VALID_TO is set to max date for 'valid' rows and set to an end date when an old SCD2 row is closed down. Its likely to be one of the biggest dimensions in our warehouse, and deals typically go through a fairly fixed lifecycle, but can be changed at any point in this life, and the lifetime can vary greatly between deals.
This results in quite high churn, but also in quite sparse churn, by that I mean updates are typically spread widely across the surrogate key and/or business keys, making it hard to leverage micropartition pruning.
The problem bites in slow UPDATE performance when significant numbers of deals are updated in status - not so much in the initial pruning, more I think in writing out new immutable micropartitions.
This is the problem step
Update DEAL_DIM
Set Valid_to=Sysdate()
Where filter_criteria
This is regardless of whether filter_criteria is a list of business keys, or a list of surrogate keys, partly because the rows being passed in are so sparse, and also because these keys are not well clustered (high cardinality and with large overlap and depth).
There is currently no clustering on the table, is there anything worth exploring to try and improve performance? Or is it just that the closing of records in type 2 SCD will always be costly if the update grouping is hard to predict?
I considered using the validity to cluster by but there is still a good chance that my update rows would be scattered amongst 'valid' micropartitions so it would only reduce the initial prune (not the costly part) and not improve the update cost (if I've determined correctly what is going on under the covers)
Another option would be to group by DEALNO (or range of DEALNO) which should at least group the children of a DEAL together if the update applies to them all, however the difference in deal lifespan will still mean a wide range of DEALNO and again I think this leads to large numbers of micropartitions being rewritten just for the VALID_TO end date.
Maybe it points to an over-denormalised set-up that needs to be normalised back into a hierachy of dimensions, but then I will potentially still have the high cost of sparse updates on the 'lowest' level.
Sizing of the table 500Million rows of which 60Million are currently valid. In approx 2000 micropartitions.
Some clarity on what i mean by sparse updates hitting all partitions.
The main business key for deals is the deal number, which is effectively a sequence. So as you can imagine when building this up the micropartitions are initially naturally well organised for that business key.
However the time for a deal to change status or some other attribute to force an SCD change is extremely variable, hence we have a chunk of the order of 10's of thousands of rows changing, but they are very well spread out in terms of that business key and effectively hit ALL micropartitons. Zero pruning around the business key is visible, however when the Valid_to_date is set to close off old rows, due to the immutable nature of snowflake micropartitions, every one that is hit has to be rewritten to cope with the update of that valid date and its the latter I believe to be the most significant cost here.
To validate that - The 10,000+ rows updated are referenced by deal number, and if I join that with the dimension I get the same inefficient pruning, but a CTAS of the key and valid date on an XS warehouse comes back within a minute. Whereas the update takes closer to 12 minutes. About the same time as CTAS the whole dimension, hence my conclusion that the cost here is writing out new immutable micropartions
Finally about 10-20% of the rows are valid at any one time