So I was assigned to estimate the performance of two queries and came to a surprising result. I was told beforehand that HAVING
is slower than WHERE
because it only filters results after accessing rows. This seems reasonable, and this question on SQL clause execution order strenghtened that.
However, I estimated the performance of the following queries with some assumptions and it seems using HAVING
the execution is actually faster!
SELECT status, count(status)
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant'
SELECT status, count(status)
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status
The assumptions were:
- The table
CUSTOMER
has 100 000 records - The cost of accessing a row is 0.01ms (SELECT + COUNT)
- The cost of executing a clause is 0.005ms
- There are three types of customer statuses, the two above and 'Deceased'
- There are 15 000 'Deceased' customers
Based on this my estimations were:
First query:
Accessing all rows, FROM: 100 000 * 0.01ms = 1000ms
GROUP BY: 100 000 * 0.005ms = 500ms
HAVING (2 conditions, 3 groups): 2 * 3 * 0.005ms = 0.03ms
SELECT and COUNT results: 15 000 * 0.01ms = 150ms
Total execution time: 1.65003s
Second query:
Accessing all the rows, FROM: 1000ms
WHERE: 2 * 100 000 * 0.005ms = 1000ms
GROUP BY: 15 000 * 0.005ms = 75ms
SELECT and COUNT results: 15 000 * 0.01ms = 150ms
Total execution time: 2.225s
The result came from the fact that GROUP BY
produces only three groups, which are very easy to filter, while WHERE
has to go through and filter the records one by one.
As I naïvely rely on authority, I'm assuming I either made a mistake somewhere or the assumptions provided are wrong.
So does GROUP BY
behave like this with HAVING
resulting in a reduced execution time?
EDIT: Query plans
PLAN_TABLE_OUTPUT /* With HAVING */
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 35 | 4 (25)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 5 | 35 | 4 (25)| 00:00:01 |
| 3 | TABLE ACCESS STORAGE FULL| CUSM | 5 | 35 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')
PLAN_TABLE_OUTPUT /* With WHERE */
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 4 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 7 | 4 (25)| 00:00:01 |
|* 2 | TABLE ACCESS STORAGE FULL| CUSM | 1 | 7 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("STATUS"<>'Active' AND "STATUS"<>'Dormant')
filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')