MODE aggregation function with redshift

Question

I have the following tables:

customers

customer_id name
----------------
1           bob
2           alice
3           tim

purchases

id customer_id item_bought
--------------------------
1  1           hat
2  1           shoes
3  2           glasses
3  2           glasses
4  2           book
5  3           shoes
6  1           hat

And I want to following result:

customer_name item_bought_most_often
------------------------------------
bob           hat
alice         glasses
tim           shoes

I would do this like this (not actually tried, just the idea):

SELECT customer.name as customer_name,
  MODE(item_bought) as item_bought_most_ofen
FROM customers
INNER JOIN purchases USING (customer_id)
GROUP_BY customer_id

However, the MODE aggregation function doesn't exist in Redshift.

It seems that Redshift user defined functions are just regular scalar functions, not aggregations functions. So I don't think I can define it myself.

Any workaround ?

Gordon Linoff · Answer 1 · 2020-02-27T02:12:28.610

10

You can mimic mode() by using row_number():

select name, item_bought
from (select c.name, p.item_bought, count(*) as cnt,
             row_number() over (partition by c.name order by count(*) desc) as seqnum
      from customers c join
           purchases p
           using (customer_id)
      group by c.name, p.item_bought
     ) cp
where seqnum = 1;

edited Feb 27 '20 at 02:12

answered Apr 26 '16 at 11:21

Gordon Linoff

1,242,037
58
646
786

Does Amazon Redshift allow to refer **`cnt`**: `select count(*) as cnt, row_number() over (order by cnt desc) as seqnum` at the same level? – Lukasz Szozda Apr 27 '16 at 11:57
@lad2025 . . . Arrrgh. Too much Google BigQuery lately. – Gordon Linoff Apr 28 '16 at 02:03
I believe this answer is missing `partition by name` to actually yield the result OP is after – Danny Feb 26 '20 at 23:03

Lukasz Szozda · Answer 2 · 2021-06-03T19:26:35.187

You could first COUNT each person purchases and then use RANK() windowed function:

SELECT name AS customer_name, item_bought AS item_bought_most_often
FROM(SELECT name,item_bought,RANK() OVER(PARTITION BY name ORDER BY cnt DESC) rnk
     FROM (SELECT c.name, p.item_bought, COUNT(*) AS cnt
           FROM customers c
           JOIN purchases p
             ON p.customer_id = c.customer_id
           GROUP BY c.name, p.item_bought) AS s1) AS s2
WHERE rnk = 1;

LiveDemo

Output:

╔═══════════════╦════════════════════════╗
║ customer_name ║ item_bought_most_often ║
╠═══════════════╬════════════════════════╣
║ alice         ║ glasses                ║
║ bob           ║ hat                    ║
║ tim           ║ shoes                  ║
║ zoe           ║ pencil                 ║
║ zoe           ║ book                   ║
╚═══════════════╩════════════════════════╝

Note:

RANK will handle multiple most common values.

EDIT:

As Lukas Eder mentioned it could be further simplified:

SELECT name AS customer_name, item_bought AS  item_bought_most_often
FROM(SELECT name,item_bought,
            RANK() OVER(PARTITION BY name ORDER BY COUNT(*) DESC) rnk
     FROM customers c
     JOIN purchases p
       ON p.customer_id = c.customer_id
     GROUP BY c.name, p.item_bought)  AS s2
WHERE rnk = 1;

db<>fiddle demo

I was doing something similar. I was really hoping for an aggregation function like `FIRST(my_column), MODE(my_column)`, or the ability to define it. But it doesn't exist. Another possibility would be something like `SPLIT_PART(LISTAGG(id, ','),',',1)`. or `udf_mode(LISTAGG, id, ',')`. udf_mode being a user defined function that calculates the mode based on string of values separated by a comma. But those are all hacky. — Benjamin Crouzier, Apr 27 '16 at 12:35
@pinouchon Based of [doc](http://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html) *`You can create a custom user-defined >>scalar<< function (UDF)`*. I do not see user-defined aggregate functions doc like in Postgresql [CREATE AGGREGATE](http://www.postgresql.org/docs/current/static/sql-createaggregate.html) Idea with using `LISTAGG` and then using udf_mode could work. — Lukasz Szozda, Apr 27 '16 at 12:44
There's no need for the second level derived table, you can `ORDER BY COUNT(*) DESC` directly — Lukas Eder, Jun 03 '21 at 14:43
@LukasEder Thank you for your comment. I have extended the answer to include that version too. — Lukasz Szozda, Jun 03 '21 at 19:27

MODE aggregation function with redshift

2 Answers2