1

I am a SQL beginner. I have trouble on how to find the top 3 max values in each category. The question was

"For order_ids in January 2006, what were the top (by revenue) 3 product_ids for each category_id? "

Table A:                    
(Column name)         
customer_id            
order_id              
order_date   
revenue  
product_id

Table B:  
product_id  
category_id

I tried to combine table B and A using an Inner Join and filtered by the order_date. But then I am stuck on how to find the top 3 max values in each category_id. Thanks.

This is so far what I can think of

SELECT B.product_id, category_id FROM A

JOIN B ON B.product_id = A.product_id

WHERE order_date BETWEEN ‘2006-01-01’ AND ‘2006-01-31’

ORDER BY revenue DESC

LIMIT 3;
Ricky C
  • 33
  • 5
  • Welcome to SO. Try to format the sql query as code - this improves readability. – Steffen Moritz Jun 18 '19 at 00:51
  • The problem with limit is that it places the limit on the immediate or closest table set and in the query above that is coming from the Product table. The query above will only pull 3 records from the product table. – Ross Bush Jun 18 '19 at 01:02

4 Answers4

4

This kind of query is typically solved using window functions

select *
from (
  SELECT b.product_id, 
         b.category_id,
         a.revenue,
         dense_rank() over (partition by b.category_id, b.product_id order by a.revenue desc) as rnk
  from A
    join b ON B.product_id = A.product_id
  where a.order_date between date '2006-01-01' AND date '2006-01-31'
) as t
where rnk <= 3
order by product_id, category_id, revenue desc;

dense_rank() will also deal with ties (products with the same revenue in the same category) so you might actually get more than 3 rows per product/category.

If the same product can show up more than once in table b (for the same category) you need to combine this with a GROUP BY to get the sum of all revenues:

select *
from (
  SELECT b.product_id, 
         b.category_id,
         sum(a.revenue) as total_revenue,
         dense_rank() over (partition by b.category_id, a.product_id order by sum(a.revenue) desc) as rnk
  from a
    join b on B.product_id = A.product_id
  where a.order_date between date '2006-01-01' AND date '2006-01-31'
  group by b.product_id, b.category_id
) as t
where rnk <= 3
order by product_id, category_id, total_revenue desc;

When combining window functions and GROUP BY, the window function will be applied after the GROUP BY.

  • Could you tell me what does the t mean in the third last of the line? Thanks – Ricky C Jun 18 '19 at 03:07
  • @RickyC: that's an alias for the sub-query (derived table). See the manual for details: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-TABLE-ALIASES –  Jun 18 '19 at 03:07
1

You can use window functions to gather the grouped revenue and then pull the last X in the outer query. I have not worked in PostgreSQL in a bit so I may be missing a shortcut function below.

WITH ByRevenue AS
(
    --This creates a virtualized table that can be queried similar to a physical table in the conjoined statements below 
    SELECT
        category_id,
        product_id,
        MAX(revenue) as max_revenue 
    FROM 
        A
        JOIN B ON B.product_id = A.product_id
    WHERE 
        order_date BETWEEN ‘2018-01-01’ AND ‘2018-01-31’
    GROUP BY
        category_id,product_id
)
,Normalized
(
    --Pull data from the in memory table above using normal sql syntax and normalize it with a RANK function to achieve the limit.
    SELECT
        category_id,
        product_id,
        max_revenue,
        ROW_NUMBER() OVER (PARTITION BY category_id,product_id ORDER BY max_revenue DESC) as rn
    FROM
        ByRevenue
)
--Final query from stuff above with each category/product ranked by revenue
SELECT * 
FROM Normalized 
WHERE RN<=3;
Ross Bush
  • 14,648
  • 2
  • 32
  • 55
  • I added the product_id to the query chain as I saw that was a requirement after I answered. – Ross Bush Jun 18 '19 at 01:21
  • you don't actually need two CTEs - you can combine GROUP BY and window functions in a single query. –  Jun 18 '19 at 03:06
  • Doh! DENSE_RANK? – Ross Bush Jun 18 '19 at 04:05
  • I am going to say, accept the answer from @a_horse_with_no_name. It is too late for me to rewrite this using the dense_rank() window, also, he had it right from the get go :/ – Ross Bush Jun 18 '19 at 04:10
0

Try using Fetch n rows only?

Note: Let's think that your primary key here is product_id, so I used them for combining the two table.

SELECT A.category,A.revenue From Table A 
INNER JOIN Table B on A.product_id = B.Product_ID 
WHERE A.Order_Date between (from date) and (to date)
ORDER BY A.Revenue DESC
Fetch first 3 rows only
piet.t
  • 11,718
  • 21
  • 43
  • 52
  • This will not adhere to the "each category" bit in the question. – Ross Bush Jun 18 '19 at 01:03
  • Thanks for the answer. But that doesn't give me the corresponding category. And somehow the TOP function doesn't work in my PostgreSQL. It keeps saying syntax error. :( – Ricky C Jun 18 '19 at 01:13
0

For top-n queries, the first thing to try is usually the lateral join:

WITH categories as (
    SELECT DISTINCT category_id
    FROM B
)
SELECT categories.category_id, sub.product_id
FROM categories
JOIN LATERAL (
    SELECT a.product_id
    FROM B
    JOIN A ON (a.product_id = b.product_id)
    WHERE b.category_id = categories.category_id
      AND order_date BETWEEN '2006-01-01' AND '2006-01-31'
    GROUP BY a.product_id
    ORDER BY sum(revenue) desc
    LIMIT 3
) sub on true;
Jeremy
  • 6,313
  • 17
  • 20