TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

Question

I'm going through TPC-DS for Amazon Athena.

It was fine until query 5.

I got some problem on query 6. (which is below)

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer c
     ,store_sales s
     ,date_dim d
     ,item i
 where       a.ca_address_sk = c.c_current_addr_sk
    and c.c_customer_sk = s.ss_customer_sk
    and s.ss_sold_date_sk = d.d_date_sk
    and s.ss_item_sk = i.i_item_sk
    and d.d_month_seq = 
         (select distinct (d_month_seq)
          from date_dim
               where d_year = 2002
            and d_moy = 3 )
    and i.i_current_price > 1.2 * 
             (select avg(j.i_current_price) 
         from item j 
         where j.i_category = i.i_category)
 group by a.ca_state
 having count(*) >= 10
 order by cnt, a.ca_state 
 limit 100;

It took more than 30 minutes so it failed with timeout.

I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category for the last part of where condition.

I don't know why this condition is needed so I deleted this part and the query ran Ok.

can you guys tell me why this part is needed?

It's the difference between calculating the average price within a single category and calculating the average price across all categories - a pretty big difference. — Damien_The_Unbeliever, Dec 15 '20 at 08:47
Tip of today: Switch to modern, explicit `JOIN` syntax. Easier to write (without errors), easier to read and maintain, and easier to convert to outer join if needed! — jarlh, Dec 15 '20 at 08:52
`DISTINCT` is not a function. Skip those extra parentheses and simply write `(select distinct d_month_seq from ...` to make code clearer! — jarlh, Dec 15 '20 at 08:53

score 0 · Answer 1 · answered Dec 15 '20 at 13:24

The j.i_category = i.i_category is subquery correlation condition. If you remove it from the subquery

select avg(j.i_current_price) 
from item j 
where j.i_category = i.i_category)

the subquery becomes uncorrelated, and becomes a global aggregation on the item table, which is easy to calculate and the query engine needs to do it once.

If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).

If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).

TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

1 Answers1