15

I have the following tables:

customers

customer_id name
----------------
1           bob
2           alice
3           tim

purchases

id customer_id item_bought
--------------------------
1  1           hat
2  1           shoes
3  2           glasses
3  2           glasses
4  2           book
5  3           shoes
6  1           hat

And I want to following result:

customer_name item_bought_most_often
------------------------------------
bob           hat
alice         glasses
tim           shoes

I would do this like this (not actually tried, just the idea):

SELECT customer.name as customer_name,
  MODE(item_bought) as item_bought_most_ofen
FROM customers
INNER JOIN purchases USING (customer_id)
GROUP_BY customer_id

However, the MODE aggregation function doesn't exist in Redshift.

It seems that Redshift user defined functions are just regular scalar functions, not aggregations functions. So I don't think I can define it myself.

Any workaround ?

Benjamin Crouzier
  • 40,265
  • 44
  • 171
  • 236

2 Answers2

10

You can mimic mode() by using row_number():

select name, item_bought
from (select c.name, p.item_bought, count(*) as cnt,
             row_number() over (partition by c.name order by count(*) desc) as seqnum
      from customers c join
           purchases p
           using (customer_id)
      group by c.name, p.item_bought
     ) cp
where seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
8

You could first COUNT each person purchases and then use RANK() windowed function:

SELECT name AS customer_name, item_bought AS item_bought_most_often
FROM(SELECT name,item_bought,RANK() OVER(PARTITION BY name ORDER BY cnt DESC) rnk
     FROM (SELECT c.name, p.item_bought, COUNT(*) AS cnt
           FROM customers c
           JOIN purchases p
             ON p.customer_id = c.customer_id
           GROUP BY c.name, p.item_bought) AS s1) AS s2
WHERE rnk = 1;

LiveDemo

Output:

╔═══════════════╦════════════════════════╗
║ customer_name ║ item_bought_most_often ║
╠═══════════════╬════════════════════════╣
║ alice         ║ glasses                ║
║ bob           ║ hat                    ║
║ tim           ║ shoes                  ║
║ zoe           ║ pencil                 ║
║ zoe           ║ book                   ║
╚═══════════════╩════════════════════════╝

Note:

RANK will handle multiple most common values.


EDIT:

As Lukas Eder mentioned it could be further simplified:

SELECT name AS customer_name, item_bought AS  item_bought_most_often
FROM(SELECT name,item_bought,
            RANK() OVER(PARTITION BY name ORDER BY COUNT(*) DESC) rnk
     FROM customers c
     JOIN purchases p
       ON p.customer_id = c.customer_id
     GROUP BY c.name, p.item_bought)  AS s2
WHERE rnk = 1;

db<>fiddle demo

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • I was doing something similar. I was really hoping for an aggregation function like `FIRST(my_column), MODE(my_column)`, or the ability to define it. But it doesn't exist. Another possibility would be something like `SPLIT_PART(LISTAGG(id, ','),',',1)`. or `udf_mode(LISTAGG, id, ',')`. udf_mode being a user defined function that calculates the mode based on string of values separated by a comma. But those are all hacky. – Benjamin Crouzier Apr 27 '16 at 12:35
  • 1
    @pinouchon Based of [doc](http://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html) *`You can create a custom user-defined >>scalar<< function (UDF)`*. I do not see user-defined aggregate functions doc like in Postgresql [CREATE AGGREGATE](http://www.postgresql.org/docs/current/static/sql-createaggregate.html) Idea with using `LISTAGG` and then using udf_mode could work. – Lukasz Szozda Apr 27 '16 at 12:44
  • 1
    There's no need for the second level derived table, you can `ORDER BY COUNT(*) DESC` directly – Lukas Eder Jun 03 '21 at 14:43
  • @LukasEder Thank you for your comment. I have extended the answer to include that version too. – Lukasz Szozda Jun 03 '21 at 19:27