how to avoid duplicate on Joining two tables

Question

Student Table

 SID    Name
 1      A
 2      B
 3      C

 Marks Table

 id mark    subject 
 1  50  physics
 2  40  biology
 1  50  chemistry
 3  30  mathematics



SELECT distinct(std.id),std.name,m.mark, row_number() over() as rownum FROM 

student std JOIN marks m ON std.id=m.id AND m.mark=50

This result is 2 times A even after using disticnt . My expected result will have only one A. if i remove row_number() over() as rownum its working fine. Why this is happening ? how to resolve. AM using DB2!!

What is the point of joining Student and Marks if you don't want A's marks in both physics and chemistry? — sceaj, Nov 11 '11 at 20:51

Charles Bretana · Accepted Answer · 2017-07-21T12:39:49.097

8

There are two rows in marks Table with id = 1 and mark = 50.. So you will get two rows in the output for each row in student table... If you only want one, you have to do a group By

 SELECT std.id, std.name, m.mark, row_number() 
  over() as rownum 
 FROM student std 
    JOIN marks m 
       ON m.id=std.id AND m.mark=50 
 Group By std.id, std.name, m.mark

edited Jul 21 '17 at 12:39

answered Nov 11 '11 at 20:51

Charles Bretana

143,358
22
150
216

1

ALWAYS put the join predicates in the join. Where clause predicates are not evaluated until after the entire result set has been generated, so unneeded rows are carried along throughout the processing, and in certain outer join scenarios, putting predicates in where clause will generate incorrect results. Finally, putting join predicates in the join puts them near the tables they are about, instead of all together at the end, which adds clarity to the query. – Charles Bretana Jul 21 '17 at 12:39
ALSO, sometimes you want to join to the same table more than once, with DIFFERENT predicates for each join. How are you going to do that in a Where clause? – Charles Bretana Sep 07 '17 at 17:46
@CharlesBretana Every major relational database is smart enough to put filtering conditions before `JOIN` conditions when actually evaluating the query; they are *quite* good at reducing the number of operations they have to perform (aka optimizing). Generally speaking, whether you put something like `m.mark = 50` in the `WHERE` or the `ON` clause doesn't even affect the query plan. As such, the aesthetic improvement is the only real consideration (for inner joins, at least), and putting such conditions in the `WHERE` clause is vastly more intuitive. – jpmc26 Jul 05 '18 at 07:40
NO, they're not, because the results of putting the predicate in each of these two places is (or can be) DIFFERENT, and the optimizer cannot be designed to read the mind of the query author. In one case the predicate is evaluated before the join, and in the other case it is evaluated after the join. – Charles Bretana Jul 06 '18 at 06:49
@CharlesBretana The results can only differ for non-`INNER` joins, which the query in your answer is not one of. Go try it with both places with your favorite DB and check whether the query plans are different. They won't be. Re-arranging steps like that is a *major* part of optimization. And in fact, for non-`INNER` joins, putting the filter in the `ON` clause makes the query slower since it has to include more rows from the unfiltered table. Yes, that's a different result set, but it's also rarely needed. It's much more common to want to filter out anything that doesn't match the filter. – jpmc26 Jul 10 '18 at 00:52
@jpmc26, OF COURSE. The devil is in the details, but promulgating the attitude that these two approaches produce identical results, WITHOUT discussing the details, (like the fact that they are not the same for outer (I have never heard outer joins referred to as "non-inner"!), joins, is at best misleading and wrong, and at worst dangerous. And do not profess to have any knowledge about the frequency of use or utility of outer joins. – Charles Bretana Jul 11 '18 at 22:51
You're the one suggesting that table filters against a constant should be categorically placed in the `ON` clause for performance reasons: "Where clause predicates are not evaluated until after the entire result set has been generated" This is 100% categorically wrong and easily demonstrated to be so. I explicitly pointed out that I was talking about inner joins with regard to aesthetics being the primary concern. Obviously, the semantics take precedence for outer joins, but your advice to place it in the `ON` clause for performance reasons is still wrong in that case. – jpmc26 Jul 11 '18 at 23:08

score 5 · Answer 2 · answered Nov 11 '11 at 21:08

5

Now that you've clarified your question as:

I want to find all students with a mark of 50 in at least one subject. I would use the query:

SELECT student.id, '50'
FROM student 
WHERE EXISTS (SELECT 1 FROM marks WHERE marks.id = student.id AND marks.mark = 50)

This also gives you flexibility to change the criteria, e.g. at least one mark of 50 or less.

answered Nov 11 '11 at 21:08

sceaj

1,573
3
12
24

1

+1 EXISTS (semi-join) is what you really want to do. You don't need a full join because you don't need all that extra data. This query is more efficient vs. doing a the extra work of a full join and then doing even more extra work to trim the dataset down to only what you wanted in the first place. – Code Magician Nov 11 '11 at 21:11
+1 for using WHERE EXISTS, but you forgot about his rownum column. – orbfish Nov 11 '11 at 21:12
@orbfish I really don't understand what the OP is trying to do with `row_number() over ()`. @zod Perhaps you can clarify? – sceaj Nov 11 '11 at 21:14
row_number() over () is used for pagination , just like LIMIT in mysql – zod Nov 11 '11 at 21:17
Reading up on `row_number() over ()` it just provides a sequential row number to each row of the result set so it certainly could have been included. The FROM and WHERE are the important aspects of the solution, but thanks for nudging me to learn something new. – sceaj Nov 11 '11 at 21:22
if i want to show the subject name , this will not work.. is it? – zod Nov 14 '11 at 15:28
Then we're back to what to show when there are more than 1 mark of 50? It seems that the requirements are not clear. – sceaj Nov 14 '11 at 17:45

score 1 · Answer 3 · answered Nov 11 '11 at 21:11

1

Similar to Charles answer, but you always want to put the predicate (mark=50) in the WHERE clause, so you're filtering before joining. If this is just homework it might not matter but you'll want to remember this if you ever hit any real data.

SELECT std.sid,
       std.name,
       m.mark,
       row_number() over() AS rownum 
 FROM student std 
      JOIN marks m 
        ON std.sid=m.id
WHERE m.mark=50
GROUP BY std.sid, std.name, m.mark

answered Nov 11 '11 at 21:11

orbfish

7,381
14
58
75

I find most query optimizers can usually figure out the best time to apply filters regardless of clause placement. Is this a DB2 quirk? – Code Magician Nov 11 '11 at 22:00
@M_M - No - DB2 (at least on the iSeries) appears to generate the same explain plans, and seems to regard the two queries as equivalent (using the same access paths); at least for examples as simple as this. @All - Placing a condition in the `WHERE` clause versus the `JOIN` clause will _not_ always return the same results (this is mostly relevant when using `LEFT` and/or `EXCEPTION` joins), so conditionals should be placed where they will generate correct results, not for performance reasons (the DB2 optimizer is pretty good, too). – Clockwork-Muse Nov 12 '11 at 00:32
@M_M - I'm familiar with it in Oracle. Sometimes it figures it out, but sometimes it figures you really mean what you say with the SQL, and winds up taking forever. – orbfish Nov 15 '11 at 00:08

how to avoid duplicate on Joining two tables

3 Answers3

Linked