I have a table clustered on s_nation_key as below.
create or replace table t1
( S_SUPPKEY string,
S_NAME string,
S_NATIONKEY string,
S_ADDRESS string,
S_ACCTBAL string) cluster by (S_NATIONKEY);
Now i have added data to it
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=7
limit 50000;
When i check data distribution in underlying micro partition itlooks good .
>select system$clustering_information('t1','S_NATIONKEY');
{ "cluster_by_keys" : "LINEAR(S_NATIONKEY)", "total_partition_count" : 1, "total_constant_partition_count" : 0, "average_overlaps" : 0.0, "average_depth" : 1.0, "partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0 } }
Again i have loaded few more record as below for particular s_nation_key set as below.
--batch load 2
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=3
LIMIT 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=1
limit 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=2
and S_ACCTBAL>0
limit 50000;
Now when i check clustering information again ,this also looks good . Now total 4 micro-partition and each distinct S_NATIONKEY value set is loaded into individual partition with no overlapping in range.So all micro-partition is having clustering depth 1.
>select system$clustering_information('t1','S_NATIONKEY');
{
"cluster_by_keys" : "LINEAR(S_NATIONKEY)",
"total_partition_count" : 4,
"total_constant_partition_count" : 4,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 4,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
Now as per Snowflake documentation and concept of query pruning, when ever we search for records belong to one cluster_key value , it should scan only particular micro-partition which will be holding that cluster_key value (basing on min/max value range of each micro-partition). But in my case it is scanning all underlying micro partition(as below)
. As per above query planning stats,it is scanning all the partitions, instead of scanning 1 .
Am i missing anything here ??What is the logic behind it ?? Please help me in understanding this scenario in Snowflake.
Thanks, @Himanshu