1

What is the best approach for clustering snowflake tables

  1. Absolute clustering by manually reloading the tables at a certain frequency based on retrieval order
  2. Create cluster key and turn on auto recluster but suspend it most of them, run it only at certain intervals may be by looking at the partition scanned column of the table

Thanks Rajib

Rajib Deb
  • 1,496
  • 11
  • 30

3 Answers3

1

There is not general across all data use patterns that applies, and also that applies across time, as clustering that the implementation is evolving (said as an outside, but watching it change over time).

Auto clustering is just like hard drive fragmentation management. Because they are both the same idea, of locating like data near, to make read perf better. And just like disk defragmentation different usage loads/patterns make the need for clstuering/defrag more important, and some usages conflict with auto-clustering.

For example we have some tables that are written in as tight a loop as we can, and we want it clustered in a pattern that is 90% aligned with the insert order. So the auto clustering is not costly to the insert pattern. But once a month we delete from these tables GDPR/PII reasons, and after update/delete change 1/3 of partitions. So it would seem doing a full table rewrite with an ORDER applied would be overkill. But because of the insert rate auto-clustering (as it stands today) thrashes for hours and costs 5x the cost to do a full table rewrite.

Also we have other tables (the contain address information) and the table is "rather small" so is full tables scanned a lot, so ordering it in the sense of auto-cluster does not make sense, but re-build the table daily, to keep the partition size small as possible, so full tables scans are the fastest they can be.. the point being auto-clustering also does micro-partition optimization, which would be useful, but we don't need the table ordered, so are not running clustering..

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
0

Your best method is to create the initial table sorted by your cluster key, and then turn on autoclustering...and then let Snowflake handle everything for you from there.

Mike Walton
  • 6,595
  • 2
  • 11
  • 22
  • that depends on how you alter the table. There are cases where autoclustering will thrash as cost more that just full tables rebuilds – Simeon Pilgrim Apr 23 '20 at 01:34
  • "Thrashing" has been reduced quite a bit with Snowflake's optimizations of the autoclustering service. You are essentially paying for the compute to rebuild only the micropartitions that need to be rebuilt vs. the amount of compute it takes to rebuild and reorder the full table. Worth testing either way for the table that you are using. If you do rebuild, I recommend the `INSERT OVERWRITE` method to rewrite the table with no downtime on that table. – Mike Walton Apr 23 '20 at 02:39
  • My "example" is within the last month recent, and I was talking to Andrew Meyendorff last week about it, so I am quite sure. it's a thing.. – Simeon Pilgrim Apr 23 '20 at 03:00
  • Never said it wasn't. Just said it had been reduced quite a bit. – Mike Walton Apr 23 '20 at 12:56
0

To cut the chase for the answers.

  1. Load the tables with sorted data/time field - which might be used to retrieve the data - Business date instead of (ETL) insert date/time. This should be good enough for most of the tables from the data retrieval performance point of view.

  2. You can choose to do re-clustering depending upon the rate of DML operation on the table

  3. Given you have an additional pattern for data access on the specific columns - you may consider adding clustering keys to the table - and let the auto clustering kick in.

It is always desirable to identify the access pattern sooner than later. Given that, to make sure you achieve performance data retrieval - auto clustering will re-arrange the data.

Auto - clustering will cost you credits but that will outplay for the performance that you will achieve.

Link here will help you make an informed decision.

Hope this helps!

FKayani
  • 981
  • 1
  • 5
  • 10