How clustering is helping in query pruning in Snowflake?

Question

I have a table clustered on s_nation_key as below.

create or replace table t1 
( S_SUPPKEY string,
  S_NAME string,
  S_NATIONKEY string,
  S_ADDRESS string,
  S_ACCTBAL string) cluster by (S_NATIONKEY);

Now i have added data to it

INSERT INTO  T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
 WHERE S_NATIONKEY=7
limit 50000;

When i check data distribution in underlying micro partition itlooks good .

>select system$clustering_information('t1','S_NATIONKEY'); 
{   "cluster_by_keys" : "LINEAR(S_NATIONKEY)",   "total_partition_count" : 1,   "total_constant_partition_count" : 0,   "average_overlaps" : 0.0, "average_depth" : 1.0,   "partition_depth_histogram" : {
        "00000" : 0,
        "00001" : 1,
        "00002" : 0,
        "00003" : 0,
        "00004" : 0,
        "00005" : 0,
        "00006" : 0,
        "00007" : 0,
        "00008" : 0,
        "00009" : 0,
        "00010" : 0,
        "00011" : 0,
        "00012" : 0,
        "00013" : 0,
        "00014" : 0,
        "00015" : 0,
        "00016" : 0   } }

Again i have loaded few more record as below for particular s_nation_key set as below.

--batch load 2
INSERT INTO  T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
 WHERE S_NATIONKEY=3
 LIMIT 50000;

--batch load 3
INSERT INTO  T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
 WHERE S_NATIONKEY=1
limit 50000;

--batch load 3
INSERT INTO  T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
 WHERE S_NATIONKEY=2
 and   S_ACCTBAL>0
limit 50000;

Now when i check clustering information again ,this also looks good . Now total 4 micro-partition and each distinct S_NATIONKEY value set is loaded into individual partition with no overlapping in range.So all micro-partition is having clustering depth 1.

>select system$clustering_information('t1','S_NATIONKEY');
{
  "cluster_by_keys" : "LINEAR(S_NATIONKEY)",
  "total_partition_count" : 4,
  "total_constant_partition_count" : 4,
  "average_overlaps" : 0.0,
  "average_depth" : 1.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 4,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

Now as per Snowflake documentation and concept of query pruning, when ever we search for records belong to one cluster_key value , it should scan only particular micro-partition which will be holding that cluster_key value (basing on min/max value range of each micro-partition). But in my case it is scanning all underlying micro partition(as below)

. As per above query planning stats,it is scanning all the partitions, instead of scanning 1 .

Am i missing anything here ??What is the logic behind it ?? Please help me in understanding this scenario in Snowflake.

Thanks, @Himanshu

score 2 · Answer 1 · answered May 21 '20 at 07:04

The Autoclustering or the clustering keys are not intended for all tables. It is usually suggested for a very large table that runs into Terra bytes in size. We should not compare the cluster key to any index kind of object that is available in most of the RDBMS systems. Here we are grouping the data into the micro partitions in an orderly fashion which helps to avoid scanning the partitions which may not contain the requested data. In the case of small tables, the engine prefers to scan all the partitions if it estimates that this is not a costly operation.

Refer to the Attention Section of the documentation :

https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html#clustering-keys-clustered-tables.

Ravindra Mishra · Accepted Answer · 2020-05-25T20:05:12.050

1

Here the size of the table is not that big, that is why it is scanning all the partition rather one. Even if you check the total size scanned it is just 7.96 mb which is small hence SF scans all partitions

edited May 25 '20 at 20:05

answered May 25 '20 at 19:44

Ravindra Mishra

56
6

How clustering is helping in query pruning in Snowflake?

2 Answers2