Snowflake query engine strategy on several with query conditions

Question

I am doing a migration job from pyspark query into snowflake query and wondering which option is better between below A, B options.

To avoid unnecessary query, I would like to go B option if there are not that significant performance difference.

In B option, does snowflake query engine automatically optimize and internally behaves like the A option?

A option

With A1 AS (select * from a1 where date='2021-10-20'),
A2 AS (select * from a2 where date='2021-10-20'),
A3 AS (select * from a3 where date='2021-10-20'),
A4 AS (select * from a4 where date='2021-10-20'),
A5 AS (select * from a5 where date='2021-10-20')
SELECT *
FROM final_merged_table

and B option

With A1 AS (select * from a1),
A2 AS (select * from a2),
A3 AS (select * from a3),
A4 AS (select * from a4),
A5 AS (select * from a5)
SELECT *
FROM final_merged_table
WHERE date = '2021-10-20'

Is it safe to assume your CTE should be reading from the previous table expression each time and the "final_merged_table" should be A5? — Greg Pavlik, Oct 20 '21 at 03:28
In real code, CTE depends each other several times such as A3 is a results of A1, A2 join and A5 is a results of A3, A4 join. But for simplicity, it would be okay to assume that final _merged_table is an union table of all A1~A5. — Jaeyoung Jang, Oct 20 '21 at 03:59

score 2 · Accepted Answer · answered Oct 20 '21 at 06:47

We can test this. First, let's construct a table with a week of dates and several million rows:

create or replace table one_week2
as
select '2020-04-01'::date + (7*seq8()/100000000)::int day, random() data, random() data2, random() data3
from table(generator(rowcount => 100000000))

Now we can write both queries to go over this table:

Option 1:

With A1 AS (select * from one_week2 where day='2020-04-05'),
A2 AS (select * from one_week2 where day='2020-04-05'),
A3 AS (select * from one_week2 where day='2020-04-05'),
A4 AS (select * from one_week2 where day='2020-04-05'),
A5 AS (select * from one_week2 where day='2020-04-05'),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table

Option 2:

With A1 AS (select * from one_week2),
A2 AS (select * from one_week2),
A3 AS (select * from one_week2),
A4 AS (select * from one_week2),
A5 AS (select * from one_week2),
final_merged_table as (
    select * from a1 
    union all select * from a2
    union all select * from a3
    union all select * from a4
    union all select * from a5)

SELECT count(*)
FROM final_merged_table
where day='2020-04-05'
;

When we run these queries, the profile for both looks identical - as the filter has been pushed down:

Option 1 profile

Option 2 profile

In summary

You can trust the Snowflake optimizer.

Trust is important, but also verify: Sometimes the optimizer can get confused by convoluted CTEs. Sometimes the Snowflake engs optimize the optimizer, and what doesn't work today, can work better tomorrow.