0

I am trying to query a table which has 1Tb of data clustered by Date and Company. A simple query is taking long time

Posting the query and query profile

SELECT
    sl.customer_code,
    qt_product_category_l3_sid,
    qt_product_brand_sid,
    sl.partner_code,
    sl.transaction_id,
    dollars_spent,
    units,
    user_pii_sid,
    promo_flag,
    media_flag
FROM 
    cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl
WHERE  
    transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd')  ) 
    AND sl.partner_code IN ('All Retailers')
    AND qt_product_category_l3_sid IN (SELECT DISTINCT qt_product_category_l3_sid 
                                       FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand 
                                       WHERE qt_product_category_l1_sid IN (246))
                                         AND qt_product_brand_sid IN (SELECT qt_product_brand_sid 
                                                                      FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand 
                                                                      WHERE qt_product_major_brand_sid IN (246903, 430138))

enter image description here

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • 1
    Take a look at the query profile to see where the time is being spent (remote IO vs processing) etc, then if you have a question like "Why is there so much remote IO for this query" then maybe we can help. – Nat Taylor Nov 05 '19 at 15:39
  • i saw the query profile, 99% of the execution time is done on Table Scan – Thiyagaraj Narayanan Nov 06 '19 at 09:33
  • Is your "simple" query leveraging the clustered columns as a filter? Unless you are filtering on Date, you aren't really pruning effectively, and Snowflake will need to scan all of your partitions to get a result. – Mike Walton Nov 06 '19 at 13:48
  • Posted the query and query plan in the question. Pls suggest an idea to improve performance – Thiyagaraj Narayanan Nov 07 '19 at 09:09
  • We are using XL warehouse by the way – Thiyagaraj Narayanan Nov 07 '19 at 09:10
  • not sure I would call two nest sub-selects on a where clause simple. Can you please alias the two instances of cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand and then use the correct aliases in all variable uses. Because at a "simple" reading it would seem you don't need the second sub-select, but I am not sure where you are getting each column from, so cannot simplify. Also 38M rows is not many. – Simeon Pilgrim Nov 07 '19 at 20:00

5 Answers5

0

"simple query" I am not sure there is such a thing. A naive query, sure.

select * from really_large_table where column1 = value;

will perform really badly if you only care for 1 or 2 of the columns. As snowflake has to load all the data. You will get a column data to row data ratio improvement by using

select column1, column2 from really_large_table where column1 = value;

now only two columns of data need to be read form the data store.

Maybe you are looking for data where the value is > 100 because you think that should not happen. Then

select column1, column2 from really_large_table where column1 > 100 limit 1;

will perform much better than

select column1, column2 from really_large_table order by column1 desc limit 50;

but if what you are doing is doing the minimum work is can to have a correct answer, you next option is to increase the warehouse size. Which for IO bound work gives a scalar improvement, but some aggregation steps don't scale as linear.

Another thing to look for with regards is sometime your calculation can produce too much intermediate state, and it "external spills" (exact wording not correct) which is much like running out of ram and going to swap disk.

Then we have seen memory pressure when doing too much work in a JavaScript UDF, that slowed things down.

But most of these can be spotted by looking at the query profile and looking at the hotspots.

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • Hi, I am selecting 4 columns only from the Table, and i m running for Last 1 year, which is scanning half of all the partitions. – Thiyagaraj Narayanan Nov 06 '19 at 09:34
  • well that is good that your only scanning half the tables. The next thing to see if what the rows drop rate is after the table scan. So if you are pulling n million rows, and then doing a filter and it drops, then your clustering might benefit form being different. If there is no rows drops, you are just processing lots of data, and get a bigger instance. They are rather linear for simple stuff. Maybe post a picture of the query plan. – Simeon Pilgrim Nov 06 '19 at 19:02
  • Posted the query and query plan in the question. Pls suggest an idea to improve performance – Thiyagaraj Narayanan Nov 07 '19 at 09:09
0

99% of the time was spent scanning the table. The filters within the query do not match your clustering keys, therefore it won't help much. Depending how much historical data you have on this table, and whether you will continue to read a year's worth of data, you might be better off (or creating a materialized view) clustering by qt_product_brand_sid or qt_product_category_l3_sid, depending on which one is going to be filtering the data quicker.

HIlda Davies
  • 141
  • 2
0

A big change requires changing the data structure of the transaction date to a true date field vs varchar.

second you have an IN clause w/ a single value. Use = instead. but for the other IN clauses, I would suggest re-writing the query to separate out those sub-queries as CTE and then just join to those CTE.

Erick Roesch
  • 221
  • 1
  • 6
0

Use this query :

    SELECT
        sl.customer_code,
        s1.qt_product_category_l3_sid,
        s1.qt_product_brand_sid,
        sl.partner_code,
        sl.transaction_id,
        s1.dollars_spent,
        s1.units,
        s1.user_pii_sid,
        s1.promo_flag,
        s1.media_flag
    FROM 
        cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl,
        cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_cat,
        cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_brand
    WHERE  
        s1.transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd')  ) 
        AND sl.partner_code IN ('All Retailers')
        AND s1.qt_product_category_l3_sid =prod_cat.qt_product_category_l3_sid 
        AND prod_cat.qt_product_category_l1_sid =246
        AND prod_cat.qt_product_brand_sid=prod_brand.qt_product_brand_sid 
        AND prod_brand.qt_product_major_brand_sid IN (246903, 430138)
MKP
  • 176
  • 3
0

Apparently performance is an area of focus for Snowflake R&D. After struggling to make complex queries perform on big data we got 100x improvements with Exasol, no tuning whatsoever.

  • Could you be specific on how complex queries helped? – Yichun Apr 13 '20 at 00:15
  • I'm not sure what you mean. Snowflake handled simple queries (~ 10 nodes in the query plan) quite well, even on huge data. However, performance degraded quickly as complexity increased. We had over 100 nodes in the query plan (Tableau generated). Snowflake support basically shrugged. One day they will have indexes and a faster more efficient query planner. – Anthony Krinsky Apr 14 '20 at 19:05
  • I mean if you can compare the "complex queries" that gave 100x improvements and your baseline queries, it would constitute a great answer. – Yichun Apr 14 '20 at 20:42