Is there a way to ensure WHERE clause happens after DISTINCT?

Question

Imagine you have a table comments in your database.

The comment table has the columns, id, text, show, comment_id_no.

If a user enters a comment, it inserts a row into the database

| id |  comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1  | 1              | hi   | true | 1/1/2000    |

If a user wants to update that comment it inserts a new row into the db

| id |  comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1  | 1              | hi   | true | 1/1/2000    |
| 2  | 1              | hey  | true | 1/1/2001    |

Notice it keeps the same comment_id_no. This is so we will be able to see the history of a comment.

Now the user decides that they no longer want to display their comment

| id |  comment_id_no | text | show  | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1  | 1              | hi   | true  | 1/1/2000    |
| 2  | 1              | hey  | true  | 1/1/2001    |
| 3  | 1              | hey  | false | 1/1/2002    |

This hides the comment from the end users.

Now a second comment is made (not an update of the first)

| id |  comment_id_no | text | show  | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1  | 1              | hi   | true  | 1/1/2000    |
| 2  | 1              | hey  | true  | 1/1/2001    |
| 3  | 1              | hey  | false | 1/1/2002    |
| 4  | 2              | new  | true  | 1/1/2003    |

What I would like to be able to do is select all the latest versions of unique commend_id_no, where show is equal to true. However, I do not want the query to return id=2.

Steps the query needs to take...

select all the most recent, distinct comment_id_nos. (should return id=3 and id=4)
select where show = true (should only return id=4)

Note: I am actually writing this query in elixir using ecto and would like to be able to do this without using the subquery function. If anyone can answer this in sql I can convert the answer myself. If anyone knows how to answer this in elixir then also feel free to answer.

I would suggest to not pollute working tables with history stuff. Instead of that use a separate auditory/history table (and preferably in a separate schema) and implement a simple trigger (which is pretty simple in postgres) to log each insert, update (and even delete if you allow it) operations into it. This will not only make things simpler to you, but it will also avoid so many issues... I don't post an example now because it doesn't answer your actual question. But if you are interested in it I could post an example. — bitifet, Feb 14 '19 at 12:59
@bitifet Thanks for your comment. If you feel that there is a better solution to this issue as a whole then please feel free to post an answer. (Or alternatively, as you said yourself because it is not a direct answer, you could link to an example? Your choice) — RobStallion, Feb 14 '19 at 13:04
@bitifet As another solution, you could do double versioning. For each record, store the date inserted, and also the date an updated record over-wrote it. Writing queries is much easier this way, since you just need to select the record where `inserted_at <= your_date AND (removed_at > your_date OR removed_at IS NULL)`. On the flip side, inserting/deleting/updating records becomes harder. — Josh Eller, Feb 14 '19 at 17:34
You are not clear re having distinct before where. Write complete sentences to actually say what you want. How can it matter whether distinct is before where?--distinct keeps distinct *rows*. "distinct comment_id_nos" doesn't make sense unless you drop other columns. Maybe you are talking about grouping and/or selecting distinct in a subquery. But it's not clear. You address lack of clarity in comments but clarifications should be edited into posts. PS This involves a faq that even has a tag, [tag:greatest-n-per-group]. From which you want rows with show=true. — philipxy, Feb 14 '19 at 23:40
@JoshEller: I don't thing it would be a good solution: It continue being more complicated than needed and even inefficient (see my answer where I explain why). — bitifet, Feb 15 '19 at 08:14
@bitifet It depends on what you're actually using this for. If you need to write queries to look at data "as of" a specific date, double versioning enormously simplifies things. If you only care about the current version of data for most cases, and only want to look at past data in special cases, your suggestion works better. — Josh Eller, Feb 15 '19 at 14:56

GarethD · Accepted Answer · 2019-02-14T13:21:01.153

You can do this without using a subquery using LEFT JOIN:

SELECT  c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM    Comments AS c
        LEFT JOIN Comments AS c2
            ON c2.comment_id_no = c.comment_id_no
            AND c2.inserted_at > c.inserted_at
WHERE   c2.id IS NULL
AND     c.show = 'true';

I think all other approaches will require a subquery of some sort, this would usually be done with a ranking function:

SELECT  c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM    (   SELECT  c.id, 
                    c.comment_id_no, 
                    c.text, 
                    c.show, 
                    c.inserted_at,
                    ROW_NUMBER() OVER(PARTITION BY c.comment_id_no 
                                      ORDER BY c.inserted_at DESC) AS RowNumber
            FROM    Comments AS c
        ) AS c
WHERE   c.RowNumber = 1
AND     c.show = 'true';

Since you have tagged with Postgresql you could also make use of DISTINCT ON ():

SELECT  *
FROM    (   SELECT  DISTINCT ON (c.comment_id_no) 
                    c.id, c.comment_id_no, c.text, c.show, c.inserted_at
            FROM    Comments AS c 
            ORDER By c.comment_id_no, inserted_at DESC
        ) x
WHERE   show = 'true';

Examples on DB<>Fiddle

Gordon Linoff · Answer 2 · 2019-02-14T13:21:40.720

5

I think you want:

select c.*
from comments c
where c.inserted_at = (select max(c2.inserted_at)
                       from comments c2
                       where c2.comment_id_no = c.comment_id_no
                      ) and
      c.show = 'true';

I don't understand what this has to do with select distinct. You simply want the last version of a comment, and then to check if you can show that.

EDIT:

In Postgres, I would do:

select c.*
from (select distinct on (comment_id_no) c.*
      from comments c
      order by c.comment_id_no, c.inserted_at desc
     ) c
where c.show

distinct on usually has pretty good performance characteristics.

edited Feb 14 '19 at 13:21

answered Feb 14 '19 at 12:45

Gordon Linoff

1,242,037
58
646
786

The reason that I was suggesting `select distinct` is because I wanted a query that gets the latest version of all comments where show = true. I believe that your answer will currently just get the last comment if show = true (please correct me if I am wrong) I'll update my question to make it more clear. – RobStallion Feb 14 '19 at 12:52
1

@RobStallion . . . `SELECT DISTINCT` doesn't return the latest version of *anything*. It simply guarantees that the result set has no duplicate rows. – Gordon Linoff Feb 14 '19 at 13:18
@Gordan Linoff. I am aware of that. I guess I wasn't clear enough sorry. My thought process was to order the entries by date, then select by distinct. Hence _"select all the **most recent**, distinct `comment_id_no`..."_ in my question. I thought that this would make it clear that I first wanted to order by date, then select the distinct values. This would be how I would return the _latest version_ of a `comment_id_no` without duplicates. Your answer seems to capture what I meant. Hopefully this is clear now. Thanks for your answer and comments – RobStallion Feb 14 '19 at 14:13
@RobStallion . . . That is, ironically, what `distinct on` does. It is specific to Postgres, but that is the database you are using. – Gordon Linoff Feb 14 '19 at 21:11

score 4 · Answer 3 · edited Feb 20 '19 at 15:36

As I told in comments I don't advice to pollute data tables with history/auditory stuff.

And no: "double versioning" suggested by @Josh_Eller in his comment isn't a good solution too: Not only for complicating queries unnecessarily but also for being much more expensive in terms of processing and tablespace fragmentation.

Take in mind that UPDATE operations never update anything. They instead write a whole new version of the row and mark the old one as deleted. That's why vacuum processes are needed to defragment tablespaces in order to recover that space.

In any case, apart of suboptimal, that approach forces you to implement more complex queries to read and write data while in fact, I suppose most of the times you will only need to select, insert, update or even delete single row and only eventually, look its history up.

So the best solution (IMHO) is to simply implement the schema you actually need for your main task and implement the auditory aside in a separate table and maintained by a trigger.

This would be much more:

Robust and Simple: Because you focus on single thing every time (Single Responsibility and KISS principles).
Fast: Auditory operations can be performed in an after trigger so every time you perform an INSERT, UPDATE, or DELETE any possible lock within the transaction is yet freed because the database engine knows that its outcome won't change.
Efficient: I.e. an update will, of course, insert a new row and mark the old one as deleted. But this will be done at a low level by the database engine and, more than that: your auditory data will be fully unfragmented (because you only write there: never update). So the overall fragmentation would be always much less.

That being said, how to implement it?

Suppose this simple schema:

create table comments (
    text text,
    mtime timestamp not null default now(),
    id serial primary key
);

create table comments_audit ( -- Or audit.comments if using separate schema
    text text,
    mtime timestamp not null,
    id integer,
    rev integer not null,
    primary key (id, rev)
);

...and then this function and trigger:

create or replace function fn_comments_audit()
returns trigger
language plpgsql
security definer
    -- This allows you to restrict permissions to the auditory table
    -- because the function will be executed by the user who defined
    -- it instead of whom executed the statement which triggered it.
as $$
DECLARE
BEGIN

    if TG_OP = 'DELETE' then
        raise exception 'FATAL: Deletion is not allowed for %', TG_TABLE_NAME;
        -- If you want to allow deletion there are a few more decisions to take...
        -- So here I block it for the sake of simplicity ;-)
    end if;

    insert into comments_audit (
        text
        , mtime
        , id
        , rev
    ) values (
        NEW.text
        , NEW.mtime
        , NEW.id
        , coalesce (
            (select max(rev) + 1 from comments_audit where id = new.ID)
            , 0
        )
    );

    return NULL;

END;
$$;

create trigger tg_comments_audit
    after insert or update or delete
    on public.comments
    for each row
    execute procedure fn_comments_audit()
;

And that's all.

Notice that in this approach you will have always your current comments data in comments_audit. You could have instead used the OLD register and only define the trigger in the UPDATE (and DELETE) operations to avoid it.

But I prefer this approach not only because it gives us an extra redundancy (an accidental deletion -in case it were allowed or the trigger where accidentally disabled- on the master table, then we would be able to recover all data from the auditory one) but also because it simplifies (and optimises) querying the history when it's needed.

Now you only need to insert, update or select (or even delete if you develop a little more this schema, i.e. by inserting a row with nulls...) in a fully transparent manner just like if it weren't any auditory system. And, when you need that data, you only need to query the auditory table instead.

NOTE: Additionally you could want to include a creation timestamp (ctime). In this case it would be interesting to prevent it of being modified in a BEFORE trigger so I omitted it (for the sake of simplicity again) because you can already guess it from the mtimes in the auditory table (even if you are going to use it in your application it would be very advisable to add it).

score 2 · Answer 4 · edited Feb 14 '19 at 13:47

2

If you are running Postgres 8.4 or higher, ROW_NUMBER() is the most efficient solution :

SELECT *
FROM (
    SELECT c.*, ROW_NUMBER() OVER(PARTITION BY comment_id_no ORDER BY inserted_at DESC) rn
    FROM comments c
    WHERE c.show = 'true'
) x WHERE rn = 1

Else, this could also be achieved using a WHERE NOT EXISTS condition, that ensures that you are showing the latest comment :

SELECT c.*
FROM comments c
WHERE 
    c.show = 'true '
    AND NOT EXISTS (
        SELECT 1 
        FROM comments c1 
        WHERE c1.comment_id_no = c.comment_id_no AND c1.inserted_at > c.inserted_at
    )

edited Feb 14 '19 at 13:47

answered Feb 14 '19 at 12:56

GMB

216,147
25
84
135

Actually on Postgres, `distinct on (..)` is usually more efficient than window functions. – Feb 14 '19 at 13:49
@a_horse_with_no_name : thanks for information (and also for the edit) ! – GMB Feb 14 '19 at 13:57

forpas · Answer 5 · 2019-02-14T14:16:14.837

1

You have to use group by to get the latest ids and the join to the comments table to filter out the rows where show = false:

select c.* 
from comments c inner join (
  select comment_id_no, max(id) maxid
  from comments
  group by comment_id_no 
) g on g.maxid = c.id
where c.show = 'true'

I assume that the column id is unique and autoincrement in comments table.
See the demo

edited Feb 14 '19 at 14:16

answered Feb 14 '19 at 12:58

forpas

160,666
10
38
76

Is there a way to ensure WHERE clause happens after DISTINCT?

5 Answers5