SQL Having ,group by, that has at least 2 son , first son named X but the second son is Y

Question

Suppose I have a tow SQL tables:

Fathers 

    id  |  name    
    -------------  
    1   | Clinton
    2   | Gates
    3   | Bush
    4   | Clinton

Sons 

    id | fatherid | name
    ---------------------
    1  |  1       | bill
    2  |  1       | suzy
    3  |  2       | mera
    4  |  2       | suzy 
    5  |  3       | bill
    6  |  3       | rose
    7  |  4       | bill

I would like to write a query that returns the Fathers who have (at least two sons) one of sons named bill and the other son is not suzy . How do I write this SQL query?

please show what you have tried so far, and what results it gave. — mc110, Jun 15 '14 at 08:05
how do you determine which one is first ? Base on the ID ? i don't think so, first you should add one more column to store the childen order, maybe a birthday column — NeedAnswers, Jun 15 '14 at 08:10
Any particular RDBMS? MySQL? Oracle? Or is this a homework question that doesn't specify? — Joachim Isaksson, Jun 15 '14 at 08:11
@user3663203: So the sample data given will not return any result, right? — Raging Bull, Jun 15 '14 at 08:14
What if a person has 3 sons, 'Bill', 'Bob' and 'Suzy'? Should he be returned or not? — Joachim Isaksson, Jun 15 '14 at 08:26
@user3663203 Yes, but the example I gave has 3 sons (ie at least 2 sons), one of which is called Suzy. Should he be returned or not? — Joachim Isaksson, Jun 15 '14 at 08:30

Joachim Isaksson · Accepted Answer · 2014-06-15T08:24:55.000

There are many ways to write this query, but most performant ones are RDBMS specific. This is just a straight forward one;

SELECT f.* FROM fathers f
WHERE id IN 
  (SELECT fatherid FROM sons WHERE name='bill')
AND id IN 
  (SELECT fatherid FROM sons WHERE name<>'bill' AND name<>'suzy')
AND id NOT IN 
  (SELECT fatherid FROM sons WHERE name='suzy')

The first subquery finds all fathers that have a son named Bill.
The second subquery finds all fathers that have a son not named Bill or Suzy.
The third subquery finds all fathers that have a son named Suzy.

The fathers that end up in the first two groups and not the third one are the fathers you're looking for.

An SQLfiddle to test with.

Raging Bull · Answer 2 · 2014-06-15T08:43:16.793

1

You can simply do this:

SELECT F.*
FROM Fathers F LEFT JOIN
     Sons S ON F.id=S.fatherid
WHERE F.id IN (SELECT fatherid FROM Sons WHERE name='bill')
AND F.id NOT IN (SELECT fatherid FROM Sons WHERE name='suzy')
GROUP BY F.id
HAVING COUNT(S.id)>=2

Result:

ID    NAME
----------
3     Bush

Result in SQL Fiddle.

Explanation:

Here we are selecting the details of father who has atleast 2 children (using the HAVING clause). And the conditions are included in the WHERE clause.

edited Jun 15 '14 at 08:43

answered Jun 15 '14 at 08:20

Raging Bull

18,593
13
50
55

1

@JoachimIsaksson: Edited my answer. And updated the fiddle. In the updated fiddle, it won't select `Clinton` as the father of suzy. – Raging Bull Jun 15 '14 at 08:44

score 1 · Answer 3 · answered Jun 15 '14 at 13:17

You ask a very basic question and you got downvoted for it. When looking at the answers, they are good and factual correct. But I think the answers are unnecessary complex, because such a basic question needs a answer on very basic level. Otherwise the answer won't help you very much, when you face a similar problem.

So let's start on how to tackle your problem on a basic level.

Note: You can find all the code in a live fiddle.

A basic approach

For a very basic approach, we will first solve the following, easy sub-problems:

Which fathers have two or more sons
Which fathers have a son named Bill
Which fathers have a son named Suzy

We will then combine these partial results using set theory: We look for all fathers that are in the first and second group but not in the third. So we are looking for the following set:

(Problem 1) AND (Problem 2) MINUS (Problem 3).

The nice thing about this approach is, that we get a query for each sub-problem that can be run on itself. So it is very easy to follow the approach with pen and paper and to debug the complete solution. Still we get a very efficient query, that can be nicely optimized by the database engine.

Fathers with two or more sons

Here we need only to access the sons. We take all the sons, group the ones with the same father together and then count, how many are in each group.

SELECT s.fatherid FROM 
  Sons s
GROUP BY s.fatherid
HAVING COUNT(s.fatherid) >= 2;

You can run this query and easily check the result. In this case you will get the fathers with id 1,2 and 3.

Fathers with son Bill

The second query is very straight forward. We take all the sons, filter them by name and then get their father:

SELECT s.fatherid FROM
  Sons s
WHERE
 s.name = 'bill';

You will get the fathers with the id 1,3 and 4.

Fathers with son Suzi

This is similar to the previous approach. You should not have any trouble with this query>

SELECT s.fatherid FROM
  Sons s
WHERE
 s.name = 'bill';

You will get the fathers with the id 1 and 2.

Combining the queries

Now let's combine the previous queries. For this you should know the SQL operators INTERSECT and EXCEPT (sometimes called MINUS). With these operators, you can combine the queries we built before to one query:

SELECT s.fatherid FROM 
  Sons s
GROUP BY s.fatherid
HAVING COUNT(s.fatherid) >= 2
INTERSECT
SELECT s.fatherid FROM
  Sons s
WHERE
 s.name = 'bill'
EXCEPT
SELECT s.fatherid FROM
  Sons s
WHERE
 s.name = 'suzy'
;

So you get the following result:

{1,2,3} INTERSECT {1,3,4} MINUS {1,2} = {3}

Hint: In the fiddle you can mark the part of the query you want to execute and press "Run SQL". This will just run the partial query you have marked and return the according partial result.

Attention: Things get a little bit dirty, when you don't have real sets. This is the case, when one of the sub-queries can return the same father-id twice (or even more often). In this case you could write SELECT DISTINCT to clean up the intermediate result. If you know for sure, that each fatherid is there at most once, you could write DISTINCT ALL or EXCEPT ALL as an optimization. But with the code above, you don't have to worry. EXCEPT and DISTINCT (without the ALL) will handle duplicates just fine.

Hint: Set theory is very good for parallelization. Writing the query this way, your database can optimize the query by running the sub-queries in parallel or by using parallelized vector operations (e.g together with so called bitmap indexes).

A more advanced approach

If you have a situation where you cannot rely on simple set theory, you can use a more advanced approach using JOINS. Joins are a usually expensive. Hence such a approach needs some care regarding performance. Additionally joins interweave the partial queries in a way, so they can't be executed independently any more.

Semi-Join

To find all fathers having a son called Bill, you should use a so called semi-join (WHERE EXISTS). The semi join tells the database engine, that you are not interested in any particular properties of the son. You only want to know if the on exists or not. This usually allows the database engine to use some special optimizations, e.g. only using indexes and bypass the underlying table completely:

SELECT f.id FROM
  Fathers f 
WHERE 
  EXISTS (
    SELECT * 
    FROM Sons s 
  WHERE
    f.id = s.fatherid
    AND s.name = 'bill'
  )
;

Attention: An anti join is sometimes written as IN. Some database engines have problems, when the IN returns more than a few thousand results. Additionally the IN does not always unlock the optimizations reserved for a semi join.

Anti-Join

For the son called Suzy you can use a so called anti-join (WHERE NOT EXISTS). This works similar to the semi join, but inverses the result set. Again, you tell the database engine that you are only interested in the existence of certain rows of a table, unlocking some very special optimizations:

SELECT f.id FROM
  Fathers f 
WHERE 
  NOT EXISTS (
    SELECT * 
    FROM Sons s 
  WHERE
    f.id = s.fatherid
    AND s.name = 'suzy'
  )
;

Attention: You cant't replace an anti-join with a semi join when you write s.name <> 'suzy'. This would return all fathers that have any son whose name is not Suzy (e.g. with a son called Bill) and not the fathers who don't have any son called Suzy.

Attention: An anti join is sometimes written as NOT IN. Some database engines have problems, when the NOT IN returns more than a few thousand results. Additionally the NOT IN does not always unlock the optimizations reserved for an anti join.

Combining into one query

Together with the query that returns all fathers with two or more sons, you can combine the semi join and the anti join into one query:

SELECT f.id FROM
  Fathers f JOIN Sons s ON (f.id = s.fatherid)
WHERE 
  EXISTS (
    SELECT * 
    FROM Sons s2 
  WHERE
    f.id = s2.fatherid
    AND s2.name = 'bill'
  )
  AND
  NOT EXISTS (
    SELECT * 
    FROM Sons s3 
  WHERE
    f.id = s3.fatherid
    AND s3.name = 'suzy'
  )
GROUP BY f.id
HAVING COUNT(s.id) >= 2
;

Drawbacks: In this query you cannot run the subqueries independent of each other. Additionally joins are usually quite expensive. So you the query might run much slower than the pure set oriented approach.