2

I have a dataset of purchases with customer, product and category.

customer     product     category    sales_value
       A     aerosol     air_care             10
       B     aerosol     air_care             12
       C     aerosol     air_care              7
       A     perfume     air_care              8
       A     perfume     air_care              2
       D     perfume     air_care             11
       C      burger         food             13
       D       fries         food              6
       C       fries         food              9

I want, for each product, the ratio between the sales value spent on this product, and the sales value spent on this product's category, by the customers who bought the product at least once.

Another way to say it: take customers who bought fries at least once, and for all of them, compute A) the sum of sales value spent on fries and B) the sum of sales value spent on food.

An intermediate table would be of the following form:

product    category  sum_spent_on_product           sum_spent_on_category    ratio
                                                 by_people_buying_product
aerosol    air_care                    29                              39     0.74
perfume    air_care                    21                              31     0.68
 burger        food                    13                              22     0.59
  fries        food                    15                              28     0.53

Example: people having bought aerosol at least once, spent a total of 1800 on this product. The same people, overall, spent 3600 on the air_care category (which aerosol belongs to). Thus, the ratio for aerosol is 0.5.

I've tried to solve this using left join lateral and compute the given intermediate results for each product, but I can't get to wrap my head around how to include the condition only for customers who bought this specific product:

select
    distinct (product_id)
  , category
  , c.sales_category
from transactions t
left join lateral (
  select
    sum(sales_value) as sales_category
  from transactions
  where category = t.category
  group by category
) c on true
;

The above query lists the sum of spent on the product's category for each product, but without the required product-buyer condition.

Is left join lateral the right way to go? Is there any other solution in plain SQL?

Jivan
  • 21,522
  • 15
  • 80
  • 131

2 Answers2

3

I would go for a windowing function to compute the total spent by each customer in each category:

SELECT
  customer, product, category, sales_value,
  sum(sales_value) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions;

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |        8.00 |   20.00
 A        | perfume | air_care |        2.00 |   20.00
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | fries   | food     |        9.00 |   22.00
 C        | burger  | food     |       13.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

Then we just have to sum up. There is just a problem when a customer has bought the same product several times. In your example, customer A has bought perfume twice. To overcome this problem let's group by customer, product and category at the same time (and sum the sales_value column):

SELECT
  customer, product, category, SUM(sales_value) AS sales_value,
  SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions
GROUP BY customer, product, category

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |       10.00 |   20.00 <-- this row summarizes rows 2 and 3 of previous result
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | burger  | food     |       13.00 |   22.00
 C        | fries   | food     |        9.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

Now we just have to sum sales_value and tot_cat to get your intermediate result table. I use a common table expression to get the previous result under the name t:

WITH t AS (
  SELECT
    customer, product, category, SUM(sales_value) AS sales_value,
    SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
  FROM transactions
  GROUP BY customer, product, category
)
SELECT
  product, category,
  sum(sales_value) AS sales_value, sum(tot_cat) AS tot_cat,
  sum(sales_value) / sum(tot_cat) AS ratio
FROM t
GROUP BY product, category;

 product | category | sales_value | tot_cat |         ratio          
---------+----------+-------------+---------+------------------------
 aerosol | air_care |       29.00 |   39.00 | 0.74358974358974358974
 fries   | food     |       15.00 |   28.00 | 0.53571428571428571429
 burger  | food     |       13.00 |   22.00 | 0.59090909090909090909
 perfume | air_care |       21.00 |   31.00 | 0.67741935483870967742
Fabian Pijcke
  • 2,920
  • 25
  • 29
  • 1
    Your solution is actually a million times faster than a `lateral join` solution, especially if there are multiple complex `where` clauses along the way. – Jivan Mar 05 '19 at 01:33
  • Nice, I had no clue about the performance of this solution. Thank you for your feedback! – Fabian Pijcke Mar 06 '19 at 07:08
2

I want, for each product, the ratio between the sales value spent on this product, and the sales value spent on this product's category, by the customers who bought the product at least once.

If I understand this correctly, you can summarize the sales by person and category to get the total for the category. In Postgres, you can keep an array of the products and use that for matching. So, the query looks like:

select p.product, p.category,
       sum(p.sales_value) as product_only_sales, 
       sum(pp.sales_value) as comparable_sales
from purchases p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category;

Here is a db<>fiddle.

EDIT:

The data allows duplicates in the date for a product. That throws things off. The solution is to pre-aggregation by product for each customer:

select p.product, p.category, sum(p.sales_value) as product_only_sales, sum(pp.sales_value) as comparable_sales
from (select customer, category, product, sum(sales_value) as sales_value
      from purchases p
      group by customer, category, product
     ) p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category

Here is the db<>fiddle for this example.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This is really interesting. However, with real data (approx. 2.5M rows) it's taking around 100 seconds to compute, even if we focus on a small subset of products. I wonder if there's a way to aggregate pp not by customer->products but the other way around in order to improve performance — the order of this aggregation should probably determined by the nature of the filter we want (as in, do we want this for a particular set of products, or for a particular set of customers). Thanks anyway, this idea is very helpful. – Jivan Feb 24 '19 at 13:53
  • 1
    @Jivan . . . That seems like a long time. How long does the subquery with the array take? An index on `customer, category, product` might prove quite helpful. – Gordon Linoff Feb 24 '19 at 14:09
  • @unutbu . . . Good catch. That is caused by duplicates in the table for a customer and a single product. – Gordon Linoff Feb 24 '19 at 20:40