-1

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.

Assume I have a table my_table like the following:

CREATE TABLE my_table(
    id int,
    my_field text,
    id_reference bigint
);

I then have a couple of values:

 id | my_field | id_reference 
----+----------+--------------
  1 | a        |            1
  1 | b        |            2
  2 | a        |            3
  2 | c        |            4
  3 | x        |            5

Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.

My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.

For example, in the above case I would like to retrieve the following output:

| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3     | 1  | b        |  2           |
+-------+----+----------+--------------+
| 3     | 2  | c        |  4           |
+-------+----+----------+--------------+
| 3     | 3  | x        |  5           |
+-------+----+----------+--------------+

My attemp is the following:

select distinct on (id)
    count(*) over () as total,
    *
from my_table
order by id, id_reference desc

This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:

 total | id | my_field | id_reference 
-------+----+----------+--------------
     5 |  1 | b        |            2
     5 |  2 | c        |            4
     5 |  3 | x        |            5
(3 rows)

As you can see it has 5 instead of the expected 3.

I can fix this by using a subquery and count as an aggregate function:

with my_values as (
  select distinct on (id)
    *
  from my_table
  order by id, id_reference desc
)
select count(*) over (), * from my_values

Which produces my expected output.

My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?

Giacomo Alzetta
  • 2,431
  • 6
  • 17
  • I"m not sure you can avoid the subquery in this case. You need to have an intermediate result somewhere in order to be able to use `COUNT(*) OVER ()` to get the row count from the actual query. – Tim Biegeleisen Jan 09 '18 at 13:25

1 Answers1

1

You are looking at my_table 3 ways:

  1. to find the latest id_reference for each id
  2. to find my_field for the latest id_reference for each id
  3. to count the distinct number of ids in the table

I therefore prefer this solution:

select
    c.id_count as total,
    a.id,
    a.my_field,
    b.max_id_reference
from
    my_table a
    join
    (
        select 
            id,
            max(id_reference) as max_id_reference
        from 
            my_table
        group by
            id
    ) b 
    on
        a.id = b.id and
        a.id_reference = b.max_id_reference
    join
    (
        select
            count(distinct id) as id_count
        from
            my_table
    ) c
    on true;

This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.

The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.

There is nothing necessarily wrong with subqueries.

Ron Ballard
  • 693
  • 6
  • 8