0

I have a (heavily simplified) orders table, total being the dollar amount, containing:

| id | client_id |   type | total |
|----|-----------|--------|-------|
|  1 |         1 |   sale |   100 |
|  2 |         1 | refund |   100 |
|  3 |         1 | refund |   100 |

And clients table containing:

| id | name |
|----|------|
|  1 | test |

I am attempting to create a breakdown, by client, metrics about the total number of sales, refunds, sum of sales, sum of refunds etc.

To do this, I am querying the clients table and joining the orders table. The orders table contains both sales and refunds, specified by the type column.

My idea was to join the orders twice using subqueries and create aliases for those filtered tables. The aliases would then be used in aggregate functions to find the sum, average etc. I have tried many variations of joining the orders table twice to achieve this but it produces the same incorrect results. This query demonstrates this idea:

SELECT
  clients.*,
  SUM(sales.total) as total_sales,
  SUM(refunds.total) as total_refunds,
  AVG(sales.total) as avg_ticket,
  COUNT(sales.*) as num_of_sales
FROM clients   
LEFT JOIN (SELECT * FROM orders WHERE type = 'sale') as sales
  ON sales.client_id = clients.id   
LEFT JOIN (SELECT * FROM orders WHERE type = 'refund') as refunds
  ON refunds.client_id = clients.id  
GROUP BY clients.id

Result:

| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
|  1 | test |         200 |           200 |        100 |            2 |

Expected result:

| id | name | total_sales | total_refunds | avg_ticket | num_of_sales |
|----|------|-------------|---------------|------------|--------------|
|  1 | test |         100 |           200 |        100 |            1 |

When the second join is included in the query, the rows returned from the first join are returned again with the second join. They are multiplied by the number of rows in the second join. It's clear my understanding of joining and/or subqueries is incomplete.

I understand that I can filter the orders table with each aggregate function. This produces correct results but seems inefficient:

SELECT
  clients.*,
  SUM(orders.total) FILTER (WHERE type = 'sale') as total_sales,
  SUM(orders.total) FILTER (WHERE type = 'refund') as total_refunds,
  AVG(orders.total) FILTER (WHERE type = 'sale') as avg_ticket,
  COUNT(orders.*) FILTER (WHERE type = 'sale') as num_of_sales      
FROM clients    
LEFT JOIN orders 
  on orders.client_id = clients.id     
GROUP BY clients.id 

What is the appropriate way to created filtered and aliased versions of this joined table?

Also, what exactly is happening with my initial query where the two subqueries are joined. I would expect them to be treated as separate subsets even though they are operating on the same (orders) table.

Rocky
  • 132
  • 1
  • 2
  • 9
  • 1
    When you get a result that you don't expect/understand, stop trying to find your overall goal & find out what your misunderstanding is.--Isolate the first unexpected/misunderstood subexpression & its input & output & learn what misconception, typo, wrong reasoning, etc led to it. (Debugging fundamental.) Ask about that. PS You don't clearly say or illustrate what you want or the problem, how are we to know? (Rhetorical.) When giving a business relation(ship)/association or table (base or query result), say what a row in it states about the business situation in terms of its column values. – philipxy Mar 05 '20 at 06:41
  • This seems likely to be a common error where people want some joins, each possibly involving a different key, of some subqueries, each possibly involving join and/or aggregation, but they erroneously try to do all the joining then all the aggregating or to aggregate over previous aggregations. [Strange duplicate behavior from GROUP_CONCAT of two LEFT JOINs of GROUP_BYs](https://stackoverflow.com/a/45252750/3404097) PS Please put all of a [mre] in your post itself (as text) not just at a fiddle. Also what parts are you able to do? – philipxy Mar 05 '20 at 06:47
  • Please see revised question. The question is written as an example based on a more complicated use case. I've tried to simplify it the best I can. I believe the core question of "where in the query can you filter and alias a joined table multiple times" is valid. – Rocky Mar 05 '20 at 07:59
  • Read & act on my 2nd comment. You don't want to calculate the 2nd left join; it's not on a key of either of its tables (the 1st join output or refunds) so you get multiple rows per user.--As I explained/linked. Calculate all your aggregations correctly separately then look at the queries & output to see how instead some can be combined to aggregate at the same time; then see that you want the resulting aggregations joined on user id. Use data with at least 2 groups. Read your post.--You still don't explain how output is to be a function of input. Act on all my 1st comment. – philipxy Mar 05 '20 at 08:26

1 Answers1

1

You should do the (filtered) aggregation once for all aggregates you want, and then join to the result of that. As your aggregation doesn't need any columns from the clients table, this can be done in a derived table. This is also typically faster than grouping the result of the join.

SELECT clients.*,
       o.total_sales, 
       o.total_refunds,
       o.avg_ticket,
       o.num_of_sales
FROM clients    
  LEFT JOIN (
    select client_id, 
           SUM(total) FILTER (WHERE type = 'sale') as total_sales,
           SUM(total) FILTER (WHERE type = 'refund') as total_refunds,
           AVG(total) FILTER (WHERE type = 'sale') as avg_ticket,
           COUNT(*) FILTER (WHERE type = 'sale') as num_of_sales      
    from orders
    group by client_id
  ) o on o.client_id = clients.id     
  • So the aggregation runs once when joining, then gets grouped by client id. I like it. What about the problem of having to filter the orders table in the same way multiple times? In the real use case there are many more aggregate functions being returned and many more columns. – Rocky Mar 05 '20 at 08:11
  • @Rocky: it's typically more efficient to do everything at once (= single scan over the table) rather than selecting from the same table multiple times. But obviously you could use that approach to create multiple derived tables each with a different `WHERE` clause instead of using `filter()` then compare the execution plans using `explain (analyze, buffers)` –  Mar 05 '20 at 08:12
  • I see. So the key for my solution was to aggregate first, then join. Using your answer with two derived tables and ````WHERE```` was the naive solution I started with. I will compare to see which method is more performant. – Rocky Mar 05 '20 at 08:26
  • @Rocky: the solution you started with, still did the aggregation in the "outer" select, not inside the derived tables. –  Mar 05 '20 at 08:27
  • Yes that is correct. I should have said my incorrect naive solution that i started with. I was referencing joining the derived table twice. – Rocky Mar 05 '20 at 08:36