If i have an large SCD2 dimension in snowflake with high and sparse churn, is there a way to end date old rows efficiently?

Question

I have a SCD2 dimension in our snowflake implementation, something like this

DEAL_SIDE_LEG_DIM

DIM_DEAL_ID Number Surrogate PK
DEALNO Business key
DEALSIDE Business key
DEALLEG Business key
various attributes
VALID_FROM date
VALID_TO date

NB - this is a denormalised hierarchy of DEALNO / DEALSIDE / DEALLEG

Where VALID_TO is set to max date for 'valid' rows and set to an end date when an old SCD2 row is closed down. Its likely to be one of the biggest dimensions in our warehouse, and deals typically go through a fairly fixed lifecycle, but can be changed at any point in this life, and the lifetime can vary greatly between deals.

This results in quite high churn, but also in quite sparse churn, by that I mean updates are typically spread widely across the surrogate key and/or business keys, making it hard to leverage micropartition pruning.

The problem bites in slow UPDATE performance when significant numbers of deals are updated in status - not so much in the initial pruning, more I think in writing out new immutable micropartitions.

This is the problem step

Update DEAL_DIM
Set Valid_to=Sysdate()
Where filter_criteria

This is regardless of whether filter_criteria is a list of business keys, or a list of surrogate keys, partly because the rows being passed in are so sparse, and also because these keys are not well clustered (high cardinality and with large overlap and depth).

There is currently no clustering on the table, is there anything worth exploring to try and improve performance? Or is it just that the closing of records in type 2 SCD will always be costly if the update grouping is hard to predict?

I considered using the validity to cluster by but there is still a good chance that my update rows would be scattered amongst 'valid' micropartitions so it would only reduce the initial prune (not the costly part) and not improve the update cost (if I've determined correctly what is going on under the covers)

Another option would be to group by DEALNO (or range of DEALNO) which should at least group the children of a DEAL together if the update applies to them all, however the difference in deal lifespan will still mean a wide range of DEALNO and again I think this leads to large numbers of micropartitions being rewritten just for the VALID_TO end date.

Maybe it points to an over-denormalised set-up that needs to be normalised back into a hierachy of dimensions, but then I will potentially still have the high cost of sparse updates on the 'lowest' level.

Sizing of the table 500Million rows of which 60Million are currently valid. In approx 2000 micropartitions.

Some clarity on what i mean by sparse updates hitting all partitions.

The main business key for deals is the deal number, which is effectively a sequence. So as you can imagine when building this up the micropartitions are initially naturally well organised for that business key.

However the time for a deal to change status or some other attribute to force an SCD change is extremely variable, hence we have a chunk of the order of 10's of thousands of rows changing, but they are very well spread out in terms of that business key and effectively hit ALL micropartitons. Zero pruning around the business key is visible, however when the Valid_to_date is set to close off old rows, due to the immutable nature of snowflake micropartitions, every one that is hit has to be rewritten to cope with the update of that valid date and its the latter I believe to be the most significant cost here.

To validate that - The 10,000+ rows updated are referenced by deal number, and if I join that with the dimension I get the same inefficient pruning, but a CTAS of the key and valid date on an XS warehouse comes back within a minute. Whereas the update takes closer to 12 minutes. About the same time as CTAS the whole dimension, hence my conclusion that the cost here is writing out new immutable micropartions

Finally about 10-20% of the rows are valid at any one time

Can you please give us more information like the number of rows being updated and if possible the profile of the query. Generally the cluster column in Snowflake is decided by Snowflake , we change if the Query is running very slow and the table size is more that 1 TB (Please let us know the table size) — Himanshu Kandpal, Oct 12 '21 at 16:58
Do you need VALID_TO as a physical column in the table? Since update operations tend to be costly in any database, an alternate approach is to store only VALID_FROM dates. Then you create a view on top of the table where you calculate VALID_TO dates using a window function such as LEAD(). Not sure if this applies in you situation but just an idea to consider. — Maja F., Oct 12 '21 at 18:17
@hkandpal Snowflake doesn't decide on a cluster column. It's either not clustered (micro-partitions are created as data is loaded without any clustering) or the table is explicitly clustered by something for performance. And general guidance is 1TB or larger, but there are many good reasons to implement clustering on smaller tables. — Mike Walton, Oct 12 '21 at 21:50
@MajaF. An interesting idea, but as its a commonly used dimension I think any saving in the cost of the update would probably be outweighed by the extra look up cost when hooking the last valid version up to facts each time. — Matt Prince, Oct 14 '21 at 13:34
@hkandpal & Mike - its at nearly 500Million rows over about 2400 micropartitions. The query profile was simple, showed a join without any pruning, then the 15,000 or so rows going through to the update, but this is crucial - 99% of the cost was in the update, very very little was in the join despite its lack of efficacy. — Matt Prince, Oct 14 '21 at 13:44

score 2 · Answer 1 · answered Oct 12 '21 at 21:47

2

Keeping in mind that Snowflake doesn't "update" it creates new micro-partitions for records that are being updated. The costly part of this is most likely on the initial scan, rather than the resulting write if your table isn't clustered. I would consider clustering on whatever it is you are joining on to update the dimension, which may be your 3 business keys? If so, some sort of concatenation (or HASH) and substring of those to group records together. This way, the initial scan is pruned efficiently AND the result write won't need to be re-clustered very much, either.

answered Oct 12 '21 at 21:47

Mike Walton

6,595
2
11
22

Thanks for taking time to look at this Mike, but I'm sure its not the initial scan. I created a CTAS using the two table with an almost identical plan up to the join, including zero pruning and this came back within a minute on an XS warehouse. I am pretty certain it's because of that immutability of the micro partitions and the sparseness of my updates hitting pretty much all of them, forcing snowflake to rewrite all for a tiny percentage of the columns. I'm wondering if clustering by a function of the validity might be an answer. – Matt Prince Oct 14 '21 at 13:54
If you partition on the primary key of the table, then only those micro-partitions will be scanned and re-written. How big is your table? – Mike Walton Oct 14 '21 at 14:01
I have 500M rows in the table and the micro partitions started roughly in line with the surrogate primary key as that is the order they are inserted. However my 10's of thousand of update rows are very sparse and therefore hit virtually all the partitions. No pruning is one cost, but rewriting every new micropartition is the big one. – Matt Prince Oct 15 '21 at 07:47
I am getting promising results with clustering by a function on the validity to which returns YYYYMM. This puts all valid rows into seperate micropartitions which is a smaller fraction of the overall history. Also means that the partitions rewritten are smaller and well clustered (as I update one set and then insert fresh both with different cluster values). Also good for historic matching of dimensions at point in time. – Matt Prince Oct 15 '21 at 07:49
That's probably a good solution as I would also think that it would help on performance of querying the dimension, as well. – Mike Walton Oct 15 '21 at 14:44

If i have an large SCD2 dimension in snowflake with high and sparse churn, is there a way to end date old rows efficiently?

1 Answers1