2

I have a simple two column table. For example's sake we can use the following to build the data:

CREATE TABLE Duplicates
  (assignmentid varchar(5), questionid varchar(5));

INSERT INTO Duplicates
  (assignmentid, questionid)
VALUES
  ('aaaaa', '11111'),
  ('aaaaa', '22222'),
  ('bbbbb', '22222'),
  ('bbbbb', '33333'),
  ('bbbbb', '33333');

There are two rows that are identical. There is also a question that appears on multiple assignments. The latter is a valid scenario, and I am trying to get a query of all the questions that are part of multiple assignments. So my desired output is:

  aaaaa, 22222
  bbbbb, 22222

I was able to get this results with this:

SELECT main.questionid, sub.assignmentid 
FROM (
   SELECT questionid, count(assignmentid) AS AssignmentCount 
   FROM ( 
      SELECT DISTINCT questionid, assignmentid 
      FROM Duplicates
   ) sub 
   GROUP BY questionid
   HAVING AssignmentCount > 1
) main
INNER JOIN (
     SELECT DISTINCT questionid, assignmentid 
     FROM Duplicates
) sub ON main.questionid = sub.questionid;

As you can see the DISTINCT subquery is being repeated twice. I could avoid this by using the WITH command, but my understanding is that this does not necessarily mean that the subquery will be executed only once. So now I am here in StackOverflow, to ask if anyone knows a more efficient way to run this query.

Kumar
  • 23
  • 1
  • 3
  • A CTE ("with") query **will** be executed once. –  Dec 30 '16 at 19:56
  • @a_horse_with_no_name Please expand on the CTE comment. I thought a CTE should be treated as a view (versus a temp table), which means every time a view is used, a call is repeated to the underlying table. Perhaps I misunderstood. Here is a link to another stackoverflow page I referred to: – Kumar Jan 17 '17 at 15:59

2 Answers2

0

Just use window functions. One method is to compare the count of answers to the count of distinct answers:

select distinct answerid, questionid
from (select d.*,
             count(distinct answerid) over (partition by questionid) as cntd,
             count(*) over (partition by questionid) as cnt
      from duplicates d
     ) d
where cntd <> cnt;

EDIT:

You can do this without count(distinct) but it requires one extra subquery:

select distinct answerid, questionid
from (select d.*,
             count((seqnum = 1)::int) over (partition by questionid) as cntd,
             count(*) over (partition by questionid) as cnt
      from (select d.*,
                   row_number() over (partition by questionid, answerid order by questionid) as seqnum
            from duplicates d
           ) d
     ) d
where cntd <> cnt;

This uses the row number for the distinct calculation.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Unfortunately Postgres does not support `distinct` in a window functions –  Dec 30 '16 at 19:57
0

You can simplify that to:

select *
from duplicates
where questionid in (select questionid
                     from duplicates
                     group by questionid
                     having count(distinct assignmentid) > 1);

The sub-query returns all questionids that are assigned to more then one assignmentid.