WHERE vs. HAVING performance with GROUP BY

Question

So I was assigned to estimate the performance of two queries and came to a surprising result. I was told beforehand that HAVING is slower than WHERE because it only filters results after accessing rows. This seems reasonable, and this question on SQL clause execution order strenghtened that.

However, I estimated the performance of the following queries with some assumptions and it seems using HAVING the execution is actually faster!

SELECT status, count(status)
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant'

SELECT status, count(status)
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status

The assumptions were:

The table CUSTOMER has 100 000 records
The cost of accessing a row is 0.01ms (SELECT + COUNT)
The cost of executing a clause is 0.005ms
There are three types of customer statuses, the two above and 'Deceased'
There are 15 000 'Deceased' customers

Based on this my estimations were:

First query:
    Accessing all rows, FROM: 100 000 * 0.01ms = 1000ms
    GROUP BY: 100 000 * 0.005ms = 500ms
    HAVING (2 conditions, 3 groups): 2 * 3 * 0.005ms = 0.03ms
    SELECT and COUNT results: 15 000 * 0.01ms = 150ms
    Total execution time: 1.65003s

Second query:
    Accessing all the rows, FROM: 1000ms
    WHERE: 2 * 100 000 * 0.005ms = 1000ms
    GROUP BY: 15 000 * 0.005ms = 75ms
    SELECT and COUNT results: 15 000 * 0.01ms = 150ms
    Total execution time: 2.225s

The result came from the fact that GROUP BY produces only three groups, which are very easy to filter, while WHERE has to go through and filter the records one by one.

As I naïvely rely on authority, I'm assuming I either made a mistake somewhere or the assumptions provided are wrong.

So does GROUP BY behave like this with HAVING resulting in a reduced execution time?

EDIT: Query plans

PLAN_TABLE_OUTPUT /* With HAVING */

| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     5 |    35 |     4  (25)| 00:00:01 |
|*  1 |  FILTER                     |      |       |       |            |          |
|   2 |   HASH GROUP BY             |      |     5 |    35 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS STORAGE FULL| CUSM |     5 |    35 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')


PLAN_TABLE_OUTPUT /* With WHERE */
-----------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |     1 |     7 |     4  (25)| 00:00:01 |
|   1 |  HASH GROUP BY             |      |     1 |     7 |     4  (25)| 00:00:01 |
|*  2 |   TABLE ACCESS STORAGE FULL| CUSM |     1 |     7 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("STATUS"<>'Active' AND "STATUS"<>'Dormant')
    filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')

why are you estimating the times like that instead of testing which one is actually faster? — Lamak, Apr 10 '18 at 16:17
@Lamak As I think I said, it was an assignment. That would be the best approach in the 'real world' I presume. — Felix, Apr 10 '18 at 16:19
But there's just no point in estimating times like you are doing. And you are asking why the result is counterintuitive....and the answer is most likely that this way of estimating execution times is wrong (or too simplistic, I don't know) — Lamak, Apr 10 '18 at 16:21
Yeah I don't get what the issue is here- having filters the set after role up while where filters before - why is it counterintuitive that having should be faster? and whats the difference just use which ever one works better for you — Daniel Marcus, Apr 10 '18 at 16:22
Do you have indexes on the 'status' column? If so, please add them to the question. — The Impaler, Apr 10 '18 at 16:24
Do you get the same timings if you use coun(*) instead of count(status)? — The Impaler, Apr 10 '18 at 16:28
@TheImpaler I think I mentioned that we were just estimating the performance of queries based on assumptions, not benchmarking actual databases. Although I'm told using `COUNT` with an asterisk is slower than naming the counted column. — Felix, Apr 10 '18 at 16:30
just as an aside, `COUNT(*)` is not the same as `COUNT(column)`, they can return different values — Lamak, Apr 10 '18 at 16:33
Well, to answer your question you'll need to take a look at the guts of Oracle. Otherwise, we are just speculating. I think you'll need to get the execution plan for each query and compare them (I can help reading them). Do: **EXPLAIN PLAN **; then **select plan_table_output from table(dbms_xplan.display('plan_table',null,'typical'))** — The Impaler, Apr 10 '18 at 16:36
@TheImpaler There they are. We have a dummy version of the database with just a few entries. So that is shown there. — Felix, Apr 10 '18 at 16:52

The Impaler · Answer 1 · 2018-04-10T17:09:58.860

Here's the thing:

According to the Oracle execution plan, both queries are performing a full table scan. That is, they are reading ALL THE ROWS of the table. No difference there.
The HAVING query performs the GROUP BY (hashing) resulting in 3 rows. Then, it applies the filter to those 3 rows, and returns the result.
The WHERE query applies the filter to each row (100,000 in the specification) after reading it, reducing them to 15,000. Finally it groups these (hashing) into 1 row, and returns one row.

I think in the case you are describing, the WHERE query applies the filter to all 100,000 rows, while the HAVING query defers the filter and only applies it to 3 rows. That makes the HAVING query faster.

Now, don't assume this result will apply to each query you have like this. Oracle is very clever in using table statistics. The plan will change in the future according to the real data you are adding to the table. A plan with 5 rows is by no means representative of a plan for 100,000 rows.

Take this result with a grain of salt. Real world scenarios are way trickier.

Thank you! So it was as I thought. – Felix Apr 10 '18 at 17:08 — Felix, Apr 10 '18 at 17:08

score 2 · Accepted Answer · answered Apr 14 '18 at 03:03

One of your assumptions is wrong: HAVING is slower than WHERE because it only filters results after accessing and hashing rows.

It's that hashing part that makes HAVING conditions more expensive than WHERE conditions. Hashing requires writing data, which can be more expensive both physically and algorithmically.

Theory

Hashing requires writing as well as reading data. Ideally hashing the data will run in O(n) time. But in practice there will be hash collisions, which slow things down. And in practice not all the data will fit in memory.

Those two problems can be disastrous. In the worst-case, with limited memory, the hashing requires multiple passes and the complexity approaches O(n^2). And writing to disk in the temporary tablespace is orders of magnitude slower than writing to memory.

Those are the kind of performance issues you need to worry about with databases. The constant time to run simple conditions and expressions is usually irrelevant compared to the time to read, write, and join the data.

That might be especially true in your environment. The operation TABLE ACCESS STORAGE FULL implies you are using Exadata. Depending on the platform you might be taking advantage of SQL in silicon. Those high-level conditions may translate perfectly to low-level instructions executed on storage devices. Which means your estimate of the cost of executing a clause may be several orders of magnitude too high.

Practice

Create a sample table with 100,000 rows:

create table customer(id number, status varchar2(100));

insert into customer
select
    level,
    case
        when level <= 15000 then 'Deceased'
        when level between 15001 and 50001 then 'Active'
        else 'Dormant'
    end
from dual
connect by level <= 100000;

begin
    dbms_stats.gather_table_stats(user, 'customer');
end;
/

Running the code in a loop shows that the WHERE version is about twice as fast as the HAVING version.

--Run times (in seconds): 0.765, 0.78, 0.765
declare
    type string_nt is table of varchar2(100);
    type number_nt is table of number;
    v_status string_nt;
    v_count number_nt;
begin
    for i in 1 .. 100 loop
        SELECT status, count(status)
        bulk collect into v_status, v_count
        FROM customer
        GROUP BY status
        HAVING status != 'Active' AND status != 'Dormant';
    end loop;
end;
/

--Run times (in seconds): 0.39, 0.39, 0.39
declare
    type string_nt is table of varchar2(100);
    type number_nt is table of number;
    v_status string_nt;
    v_count number_nt;
begin
    for i in 1 .. 100 loop
        SELECT status, count(status)
        bulk collect into v_status, v_count
        FROM customer
        WHERE status != 'Active' AND status != 'Dormant'
        GROUP BY status;
    end loop;
end;
/

WHERE vs. HAVING performance with GROUP BY

EDIT: Query plans

2 Answers2

Theory

Practice