35

I am looking for clarification on this. I am writing two queries below:

We have a table of employee name with columns ID , name , salary

  1.  Select name from employee 
    where sum(salary) > 1000 ;

  2.  Select name from employee 
    where substring_index(name,' ',1) = 'nishant' ;

Query 1 doesn't work but Query 2 does work. From my development experience, I feel the possible explanation to this is:

The sum() works on a set of values specified in the argument. Here 'salary' column is passed , so it must add up all the values of this column. But inside where clause, the records are checked one by one , like first record 1 is checked for the test and so on. Thus sum(salary) will not be computed as it needs access to all the column values and then only it will return a value.

Query 2 works as substring_index() works on a single value and hence here it works on the value supplied to it.

Can you please validate my understanding.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Nishant_Singh
  • 748
  • 1
  • 6
  • 16
  • 2
    What do you think you could achieve with query 1? What do you think it does/should do (in English)? So you want the name of each employee with the "sum of salary" higher than 1000? But each employee has only one salary, so what do you mean with the "sum of salary" ? – Erwin Bolwidt Feb 26 '17 at 16:16
  • @ErwinBolwidt if in place of sum() , i am using avg() , say i want employees whose salary is more than the avg(). I wanted to just validate my explanation why i cannot use an aggregate function in a where clause . – Nishant_Singh Feb 26 '17 at 16:21
  • @user3527008 - You can use them in `having` clause – Gurwinder Singh Feb 26 '17 at 16:22
  • @GurV the question is why not in where and i have written an answer to that in the question . Is that correct sir ? – Nishant_Singh Feb 26 '17 at 16:23
  • [Aggregate functions can appear in select lists and in ORDER BY and HAVING clauses.](https://docs.oracle.com/database/121/SQLRF/functions003.htm#SQLRF20035). You've said in the question why it wouldn't make sense in the`where` clause, so I'm still not sure what you're really asking. You seem to just want a Yes/No response to whether what you said is right? – Alex Poole Feb 26 '17 at 16:47
  • 1
    @AlexPoole - I read the question to mean, "Why did the designers of SQL need to restrict the use of aggregate functions, so they can only be in a `HAVING` clause and not in `WHERE`? Is there a logical reason why they **can't** be in the `WHERE` clause?" –  Feb 26 '17 at 16:56
  • @AlexPoole I want ? I was looking for more clarity if someone can add something to it if it is a yes , but its a No , so i gave an explanation which i feel is correct . Well i have got my answer . Thanks :) – Nishant_Singh Feb 26 '17 at 16:56
  • @mathguy yes thats what i was asking. I am new to sql and was trying to understand this . – Nishant_Singh Feb 26 '17 at 17:39
  • 4
    Of course, a better question would be "why can't have a `where` clause like `where sal = max(sal)` or `where sal > avg(sal)` (even without a `group by` clause - put all the rows in one group). The reason, as I explained, is that this would require circular reasoning, because `max` and `avg` are not applied to all the rows in the base table; they are applied only to the rows that satisfy the `where` conditions! If you need `max` or `avg` for the entire base table, you must compute that separately in a subquery. –  Feb 26 '17 at 17:48
  • @mathguy Also, I think it's mostly due to wanting to achieve brevity in code, plus some laziness (who don't want to be lazy? ). If SQL is Linq-like, we could achieve some brevity with this syntax: `select * from employees e where e.sal > avg(employees.sal)`, but unlike Linq, aliases are not imposed on SQL, there could be ambiguity if the `from employees` is not aliased to `e`. What `avg(employees.sal)` mean if there is no alias? Does it come from after the where is applied or before it's applied? – Michael Buen Apr 19 '19 at 09:27
  • @mathguy To some effect, `where e.sal > avg(employees.sal)` can be achieved in C# using Linq. `where e.Sal > employees.Average(x => x.Sal)` https://dotnetfiddle.net/emCx2l – Michael Buen Apr 19 '19 at 09:44
  • @MichaelBuen - the same can be achieved in SQL, but it requires a subquery. The question was specifically about using aggregate functions in the same query that does the aggregation. That is not possible for LOGICAL reasons, so there's no way C# and Linq, or anything/anyone else, can do anything of the sort. Whatever Linq does must be equivalent to computing the average first, before applying the filter. –  Apr 19 '19 at 14:15
  • @mathguy Yes indeed, it compute the average first before applying the filter. It comes close to the brevity that is desired without incurring any ambiguities. RDBMS can somewhat achieve the same, if aliases are imposed it can disambiguate where the aggregation should be applied. OOP syntax in RDBMS would be nice too: `where e.Sal > employees.AVG(Sal)`. With that, there would be really no more confusion where the AVG should be applied. This would be a syntax error in RDBMS if it gained OOP syntax: `where e.Sal > e.AVG(Sal)`, since `e` won't have `AVG` method as it's just one row – Michael Buen Apr 19 '19 at 14:48
  • If RDBMS is designed with the same syntax as Linq, it will be able detect that where `e.Sal > e.AVG(Sal)` is a syntax error, similar to this: https://dotnetfiddle.net/r6IuFo This one is not a syntax error: `where e.Sal > employees.AVG(Sal)` https://dotnetfiddle.net/emCx2l – Michael Buen Apr 19 '19 at 14:54
  • Programmers are naturally lazy. Linq users can avoid the ritual of subquery syntax `where salary > (select avg(salary) from employees)` and achieve some brevity of code, as functionalities are just a method away. http://www.anicehumble.com/2019/04/salary-greater-than-average-salary-conundrum.html – Michael Buen Apr 19 '19 at 15:08

4 Answers4

100

The reason you can't use SUM() in the WHERE clause is the order of evaluation of clauses.

FROM tells you where to read rows from. Right as rows are read from disk to memory, they are checked for the WHERE conditions. (Actually in many cases rows that fail the WHERE clause will not even be read from disk. "Conditions" are formally known as predicates and some predicates are used - by the query execution engine - to decide which rows are read from the base tables. These are called access predicates.) As you can see, the WHERE clause is applied to each row as it is presented to the engine.

On the other hand, aggregation is done only after all rows (that verify all the predicates) have been read.

Think about this: SUM() applies ONLY to the rows that satisfy the WHERE conditions. If you put SUM() in the WHERE clause, you are asking for circular logic. Does a new row pass the WHERE clause? How would I know? If it will pass, then I must include it in the SUM, but if not, it should not be included in the SUM. So how do I even evaluate the SUM condition?

  • 1
    If you're looking for the solution, chances are all you need to do is move your `WHERE` statement into a `HAVING` clause. – deed02392 Jul 01 '20 at 11:14
  • 1
    @deed02392 - if you jump in three and a half years after the fact, perhaps you should read the question untill you understand it first. The OP wanted to understand the theoretical reason why the condition **must** be in a `HAVING` clause and not in the `WHERE` clause. The solution (same as you propose) is easy to give and known to all. The more subtle question is **why** it can only be done that way. You may also want to read the comments under the OP's question, where there was further clarification of the purpose for this thread. –  Jul 11 '20 at 20:55
33

Why can't we use aggregate function in where clause

Aggregate functions work on sets of data. A WHERE clause doesn't have access to entire set, but only to the row that it is currently working on.

You can of course use HAVING clause:

select name from employee 
group by name having sum(salary) > 1000;

If you must use WHERE, you can use a subquery:

select name from (
    select name, sum(salary) total_salary from employee
    group by name
) t where total_salary > 1000;
Gurwinder Singh
  • 38,557
  • 6
  • 51
  • 76
6

sum() is an aggregation function. In general, you would expect it to work with group by. Hence, your first query is missing a group by. In a group by query, having is used for filtering after the aggregation:

Select name
from employee 
group by name
having sum(salary) > 1000 ;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Using having works since the query goes direct to the rows in that column while where fails since the query keep looping back and forth whenever conditions is not met.

Paul Kiarie
  • 37
  • 1
  • 8