Remove duplicates after joining three different tables

Question

I'm working now on SQL query for Polarion widget, where I want get items with unresolved comments. Unfortunately, I have a problem with joining three tables where one has optional empty/null. In Polarion is a few different tables to contain comments, most of them has a resolved status. I'm using WORKITEM, COMMENT, MODULECOMMENT tables, because sometimes workitem has comment here, sometimes in another table.

Difference is one: when workitem has "normal" comment, then there is always FK key in COMMENT table, but not each workitem has a FK MODULE key (table structure).

In default widget, I can use SQL query with a one big limit: I can't replace whole select and from, because there is a static line: SELECT WORKITEM.C_URI and FROM WORKITEM, so I can only add something (I tried DISTINCT ON, but how can you see, I can't replace that line)

Polarion has another problem too, each document has headings, texts etc. and each of this element is separate workitem with the same FK_URI_MODULE (and documents use MODULECOMMENT table).

I want remove duplicates from MODULECOMMENT (ignore situation where workitem.FK_URI_MODULE is empty/null, because it's normal state).

The best what I created, It's here (in this situation I replace is null to is not null, because It's return less items):

select workitem.c_uri, workitem.FK_URI_MODULE
from workitem 
left join COMMENT on COMMENT.FK_URI_WORKITEM = WORKITEM.C_URI 
left join MODULECOMMENT on MODULECOMMENT.FK_URI_MODULE = WORKITEM.FK_URI_MODULE 
where true
and workitem.fk_uri_project = 231 
and (comment.c_resolved is not null 
or modulecomment.c_resolved is not null);

and my results:

c_uri  | fk_uri_module
--------+---------------
   7952 |          7940
   7949 |          7940
   7953 |          7940
   7964 |          7940
 124141 |        124138
    609 |
   1347 |
    609 |

but I want get something like that:

c_uri  | fk_uri_module
--------+---------------
   7952 |          7940
 124141 |        124138
    609 |
   1347 |
    609 |

The last three records are also duplicates regarding the NULL value of the right column. These should be really ignored? — S-Man, Feb 25 '21 at 08:49

score 1 · Answer 1 · answered Feb 25 '21 at 10:15

Don't generate the duplicates to begin with by not joining, but using an EXISTS condition:

select workitem.c_uri, workitem.fk_uri_module
from workitem 
where exists (select * 
              from comment 
                left join modulecomment on modulecomment.fk_uri_module = workitem.fk_uri_module 
              where comment.fk_uri_workitem = workitem.c_uri
                and (comment.c_resolved is not null 
                     or modulecomment.c_resolved is not null))
where workitem.fk_uri_project = 231

I have to change second where to "and" (before workitem.fk_uri_project = 231) and it doesn't work because there is any modulecomment records. — Paweł Jamroziak, Feb 25 '21 at 10:48

S-Man · Answer 2 · 2021-02-25T11:14:40.277

You can use DISTINCT ON for this, which returns only the first record of a group:

demos:db<>fiddle

select DISTINCT ON (workitem.FK_URI_MODULE) 

    workitem.c_uri, workitem.FK_URI_MODULE
from workitem 
left join COMMENT on COMMENT.FK_URI_WORKITEM = WORKITEM.C_URI 
left join MODULECOMMENT on MODULECOMMENT.FK_URI_MODULE = WORKITEM.FK_URI_MODULE 
where true
    and workitem.fk_uri_project = 231 
    and (comment.c_resolved is not null 
    or modulecomment.c_resolved is not null)

The interesting part is the NULL records. The query above returns only one of the NULL records, because DISTINCT ON also removes duplicates for NULL.

So, if you want to keep all NULL values, we need to treat the NULL records differently:

SELECT DISTINCT ON (fk_uri_module)  -- 1
    *
FROM <your_query>
WHERE fk_uri_module IS NOT NULL

UNION ALL

SELECT                              -- 2
    *
FROM <your_query>
WHERE fk_uri_module IS NULL

Apply DISTINCT ON on all records with non-NULL values
UNION ALL records with NULL afterwards

To avoid executing your entire query twice, you can move them into a CTE (WITH clause) and reference it later:

WITH my_result AS (
    -- <your_query>
)
SELECT DISTINCT ON (fk_uri_module)
    *
FROM my_result 
WHERE fk_uri_module IS NOT NULL

UNION ALL

SELECT
    *
FROM my_result 
WHERE fk_uri_module IS NULL

Edit:

demo:db<>fiddle

If you cannot use DISTINCT ON, a valid alternative for this is the row_number() window function:

SELECT
    *
FROM (
    select
        workitem.c_uri, workitem.FK_URI_MODULE,
    
        row_number() OVER (PARTITION BY workitem.FK_URI_MODULE)

    from workitem 
    left join COMMENT on COMMENT.FK_URI_WORKITEM = WORKITEM.C_URI 
    left join MODULECOMMENT on MODULECOMMENT.FK_URI_MODULE = WORKITEM.FK_URI_MODULE 
    where true
        and workitem.fk_uri_project = 231 
        and (comment.c_resolved is not null 
        or modulecomment.c_resolved is not null)
) s
WHERE row_number = 1

Added an alternative approach, but this only works if you are allowed to use subqueries. If that's not possible too, then your only chance is to avoid the duplicate joins before. Later remove is not possible with a way you like. — S-Man, Feb 25 '21 at 11:14
However, even if a solution does not fit completely your requirements, please don't forget to UPVOTE every answer which is helpful in a way (that can be: give a good hint, point out an approach or even show that your goal is not achievable). This honors the time and effort the repliers invested into your specific problem, it's their only reward and it motivates them to do so. :) We'd appreciate! :) — S-Man, Feb 25 '21 at 11:21

Remove duplicates after joining three different tables

2 Answers2