96

I have the following schema:

CREATE TABLE author (
    id   integer
  , name varchar(255)
);
CREATE TABLE book (
    id        integer
  , author_id integer
  , title     varchar(255)
  , rating    integer
);

And I want each author with its last book:

SELECT book.id, author.id, author.name, book.title as last_book
FROM author
JOIN book book ON book.author_id = author.id

GROUP BY author.id
ORDER BY book.id ASC

Apparently you can do that in mysql: Join two tables in MySQL, returning just one row from the second table.

But postgres gives this error:

ERROR: column "book.id" must appear in the GROUP BY clause or be used in an aggregate function: SELECT book.id, author.id, author.name, book.title as last_book FROM author JOIN book book ON book.author_id = author.id GROUP BY author.id ORDER BY book.id ASC

It's because:

When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions, since there would be more than one possible value to return for an ungrouped column.

How can I specify to postgres: "Give me only the last row, when ordered by joined_table.id, in the joined table ?"


Edit: With this data:

INSERT INTO author (id, name) VALUES
  (1, 'Bob')
, (2, 'David')
, (3, 'John');

INSERT INTO book (id, author_id, title, rating) VALUES
  (1, 1, '1st book from bob', 5)
, (2, 1, '2nd book from bob', 6)
, (3, 1, '3rd book from bob', 7)
, (4, 2, '1st book from David', 6)
, (5, 2, '2nd book from David', 6);

I should see:

book_id author_id name    last_book
3       1         "Bob"   "3rd book from bob"
5       2         "David" "2nd book from David"
Benjamin Crouzier
  • 40,265
  • 44
  • 171
  • 236

7 Answers7

125
select distinct on (author.id)
    book.id, author.id, author.name, book.title as last_book
from
    author
    inner join
    book on book.author_id = author.id
order by author.id, book.id desc

Check distinct on

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

With distinct on it is necessary to include the "distinct" columns in the order by. If that is not the order you want then you need to wrap the query and reorder

select 
    *
from (
    select distinct on (author.id)
        book.id, author.id, author.name, book.title as last_book
    from
        author
        inner join
        book on book.author_id = author.id
    order by author.id, book.id desc
) authors_with_first_book
order by authors_with_first_book.name

Another solution is to use a window function as in Lennart's answer. And another very generic one is this

select 
    book.id, author.id, author.name, book.title as last_book
from
    book
    inner join
    (
        select author.id as author_id, max(book.id) as book_id
        from
            author
            inner join
            book on author.id = book.author_id
        group by author.id
    ) s
    on s.book_id = book.id
    inner join
    author on book.author_id = author.id
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • 2
    Does the job. `distinct on` is a bit postgres-specific. If there is another way, i'd be glad to know it. – Benjamin Crouzier Jun 04 '14 at 16:21
  • @pinouchon One other way is window functions as in Lennart's answer. – Clodoaldo Neto Jun 04 '14 at 16:23
  • Just wondering: performance wise, which is better: `distinct on` or `over partition by` for my use case ? – Benjamin Crouzier Jun 04 '14 at 16:27
  • 1
    @My guess goes to `distinct on`. But don't guess. Check both solutions with `explain analyze`. New solution in my answer – Clodoaldo Neto Jun 04 '14 at 16:31
  • 2
    `distinct on` is cool feature but please remember it causes to sorting which is good if can execute in memory. Once data set in subquery became larger the sorting involves disk operations (temp file will be written to disk to make sorting happen) – Zorg Apr 15 '16 at 18:00
  • 1
    @Zorg does the "window" solution avoids that? It's so strange this isn't more trivial... – Andy Hayden Oct 03 '19 at 05:40
  • 1
    Oh believe me... 'Strange this isn't more trivial' has been boggling me for months. I wish the postgres team (and I do love them dearly) would get on this issue rather than exotic xyz, this would touch so many more people than 'Allow invisible PROMPT 2' and the other stuff they work on. Just like Zorg says, you need to use distinct on an outer query, from subquery results. Your performance will instantly go to sh*t regardless of what indexes you have, as the entire subquery will be evaluated before your outer distinct and sort... – Kevin Parker Mar 29 '20 at 23:54
  • It works for me, Here I use that in sequelize ```attributes: [ Sequelize.literal('DISTINCT ON("device".id) 1'), 'id', 'mac_address', 'serial_number', 'device_type_id', 'createdAt']``` – Neel Rathod Dec 27 '21 at 08:45
26

I've done something similar for a chat system, where room holds the metadata and list contains the messages. I ended up using the Postgresql LATERAL JOIN which worked like a charm.

SELECT MR.id AS room_id, MR.created_at AS room_created, 
    lastmess.content as lastmessage_content, lastmess.datetime as lastmessage_when
FROM message.room MR
    LEFT JOIN LATERAL (
        SELECT content, datetime
        FROM message.list
        WHERE room_id = MR.id
        ORDER BY datetime DESC 
        LIMIT 1) lastmess ON true
ORDER BY lastmessage_when DESC NULLS LAST, MR.created_at DESC

For more info see https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral

Mirthe
  • 414
  • 4
  • 9
  • This answer seems easier to understand and more modern. Does anyone know if there are drawbacks to this over the other ones? – buzzb0x May 15 '23 at 16:09
  • This is what I'm looking for. Join lateral seems similar with subquery but it allows reference columns to "outer query" – wijayaerick Jun 15 '23 at 06:15
  • Link in heap.io has moved to [here](https://www.heap.io/blog/postgresqls-powerful-new-join-type-lateral). – nzn Aug 31 '23 at 12:30
12

This may look archaic and overly simple, but it does not depend on window functions, CTE's and aggregating subqueries. In most cases it is also the fastest.

SELECT bk.id, au.id, au.name, bk.title as last_book
FROM author au
JOIN book bk ON bk.author_id = au.id
WHERE NOT EXISTS (
    SELECT *
    FROM book nx
    WHERE nx.author_id = bk.author_id
    AND nx.book_id > bk.book_id
    )
ORDER BY book.id ASC
    ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I find it hard to believe that NOT EXISTS is ever the fastest for anything except extremely trivial datasets... – Ajax Jun 07 '17 at 00:41
  • 3
    `EXISTS()` is older than sql92 style joins. Before outer joins existed, we had tobuild them by using`select ... from a where not exists(select ... from b where ...) union all select ... from a,b where ...` That is why developers put a lot of effort into implementing them. When available, indexes are used to implement the anti-join. [on most platforms] – wildplasser Jun 07 '17 at 12:50
  • 3
    The anti-join approach was faster than the other solutions suggested here for me. Always check with an `explain analyze`. This also surprised me @Ajax. – Mathieu Mitchell Jan 10 '20 at 16:02
8

You could add a rule into the join for specifying only one row. I had work for me.

Like this:

SELECT 
    book.id, 
    author.id, 
    author.name, 
    book.title as last_book
FROM author auth1
JOIN book book ON (book.author_id = auth1.id AND book.id = (select max(b.id) from book b where b.author_id = auth1))
GROUP BY auth1.id
ORDER BY book.id ASC

This way you get the data from the book with the higher ID. You could add "date" and make the same with the max(date).

Taochok
  • 359
  • 4
  • 7
5

Here is one way:

SELECT book_id, author_id, author_name, last_book
FROM (
    SELECT b.id as book_id
         , a.id as author_id
         , a.name as author_name
         , b.title as last_book
         , row_number() over (partition by a.id
                              order by b.id desc) as rn
    FROM author a
    JOIN book b 
        ON b.author_id = a.id
) last_books
WHERE rn = 1;
Benjamin Crouzier
  • 40,265
  • 44
  • 171
  • 236
Lennart - Slava Ukraini
  • 6,936
  • 1
  • 20
  • 32
  • I'm getting `subquery in FROM must have an alias` – Benjamin Crouzier Jun 04 '14 at 16:19
  • @pinouchon, replace the 2nd last line `)` with `) a` for example (which would give the subquery an alias of `a`) – OGHaza Jun 04 '14 at 16:23
  • @OGHaza Yes, thanks. I was wondering if I have to include the alias in the select. But just naming the alias is enough. – Benjamin Crouzier Jun 04 '14 at 16:24
  • @pinouchon, as long as there are no conflicts in column names you don't need to use the alias, although it's good practice to anyway at least for complex queries (e.g. if you join table A to table B, and both have a column called ID, you can't just SELECT ID - it'll throw an "ambiguous column name ID" error) – OGHaza Jun 04 '14 at 16:27
0

As a slight variation on @wildplasser's suggestion, which still works across implementations, you can use max rather than not exists. This reads better if you like short joins better than long where clauses

select * 
  from author au
  join (
    select max(id) as max_id, author_id
      from book bk
     group by author_id) as lb 
    on lb.author_id = au.id
  join bk 
    on bk.id = lb.max_id;

or, to give a name to the subquery, which clarifies things, go with WITH

with last_book as 
   (select max(id) as max_id, author_id
      from book bk
     group by author_id)

select * 
  from author au
  join last_book lb
    on au.id = lb.author_id
  join bk 
    on bk.id = lb.max_id;
jobermark
  • 248
  • 2
  • 11
  • 2
    Be really careful with using max... if the other columns you are selecting differ, you could wind up with a result that is a mix of two colums... not likely ever something you actually want. – Ajax Jun 07 '17 at 00:43
0
create temp table book_1 as (
SELECT
id
,title
,author_id
,row_number() OVER (PARTITION BY id) as rownum 
FROM
book)  distributed by ( id );

select author.id,b.id, author.id, author.name, b.title as last_book
from
    author

    left  join
   (select * from  book_1 where rownum = 1 ) b on b.author_id = author.id
order by author.id, b.id desc
סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68