SQL performance MAX()

Question

Just got a small question. When trying to get a single max-Value of a table. Which one is better?

SELECT MAX(id) FROM myTable WHERE (whatever)

or

SELECT TOP 1 id FROM myTable WHERE (whatever) ORDER BY id DESC

I'm using Microsoft SQL Server 2012

Did you test it? I'd expect them to be the same if the optimizer is any good. — Hogan, Jul 01 '12 at 20:31
If `id` is auto-increment this question is a duplicate of http://stackoverflow.com/questions/590079/for-autoincrement-fields-maxid-vs-top-1-id-order-by-id-desc — Ben, Jul 01 '12 at 20:32
Will it be efficient if MAX(id) OVER ( PARTITION BY col_name ) is used instead of TOP 1 or just MAX(id)? — Sanushi Salgado, Apr 15 '23 at 17:43

Aaron Bertrand · Accepted Answer · 2012-07-02T00:26:32.893

There will be no difference as you can test yourself by inspecting the execution plans. If id is the clustered index, you should see an ordered clustered index scan; if it is not indexed, you'll still see either a table scan or a clustered index scan, but it won't be ordered in either case.

The TOP 1 approach can be useful if you want to pull along other values from the row, which is easier than pulling the max in a subquery and then joining. If you want other values from the row, you need to dictate how to deal with ties in both cases.

Having said that, there are some scenarios where the plan can be different, so it is important to test depending on whether the column is indexed and whether or not it is monotonically increasing. I created a simple table and inserted 50000 rows:

CREATE TABLE dbo.x
(
  a INT, b INT, c INT, d INT, 
  e DATETIME, f DATETIME, g DATETIME, h DATETIME
);
CREATE UNIQUE CLUSTERED INDEX a ON dbo.x(a);
CREATE INDEX b ON dbo.x(b)
CREATE INDEX e ON dbo.x(e);
CREATE INDEX f ON dbo.x(f);

INSERT dbo.x(a, b, c, d, e, f, g, h)
SELECT 
  n.rn, -- ints monotonically increasing
  n.a,  -- ints in random order
  n.rn, 
  n.a, 
  DATEADD(DAY, n.rn/100, '20100101'), -- dates monotonically increasing
  DATEADD(DAY, -n.a % 1000, '20120101'),     -- dates in random order
  DATEADD(DAY, n.rn/100, '20100101'),
  DATEADD(DAY, -n.a % 1000, '20120101')
FROM
(
  SELECT TOP (50000) 
     (ABS(s1.[object_id]) % 10000) + 1, 
     rn = ROW_NUMBER() OVER (ORDER BY s2.[object_id])
  FROM sys.all_objects AS s1 
  CROSS JOIN sys.all_objects AS s2
) AS n(a,rn);
GO

On my system this created values in a/c from 1 to 50000, b/d between 3 and 9994, e/g from 2010-01-01 through 2011-05-16, and f/h from 2009-04-28 through 2012-01-01.

First, let's compare the indexed monotonically increasing integer columns, a and c. a has a clustered index, c does not:

SELECT MAX(a) FROM dbo.x;
SELECT TOP (1) a FROM dbo.x ORDER BY a DESC;

SELECT MAX(c) FROM dbo.x;
SELECT TOP (1) c FROM dbo.x ORDER BY c DESC;

Results:

enter image description here

The big problem with the 4th query is that, unlike MAX, it requires a sort. Here is 3 compared to 4:

enter image description here

This will be a common problem across all of these query variations: a MAX against an unindexed column will be able to piggy-back on the clustered index scan and perform a stream aggregate, while TOP 1 needs to perform a sort which is going to be more expensive.

I did test and saw the exact same results across testing b+d, e+g, and f+h.

So it seems to me that, in addition to producing more standards-compliance code, there is a potential performance benefit to using MAX in favor of TOP 1 depending on the underlying table and indexes (which can change after you've put your code in production). So I would say that, without further information, MAX is preferable.

(And as I said before, TOP 1 might really be the behavior you're after, if you're pulling additional columns. You'll want to test MAX + JOIN methods as well if that's what you're after.)

+1 - But was any SQL used to calculate your percentages or did you interrogate the XML for the execution plan? Would be nice to have that in your answer so that future readers are also given knowledge on how to interrogate these plans themselves. — Wayne, Jul 01 '12 at 21:50
Just displayed relevant output from SQL Sentry Plan Explorer, a free tool from my company. http://sqlsentry.net/ — Aaron Bertrand, Jul 01 '12 at 22:13
[A good article on `Top N` sort here.](http://sqlblog.com/blogs/paul_white/archive/2010/08/27/sorting-row-goals-and-the-top-100-problem.aspx) if anyone's interested. It doesn't have to actually sort all the rows (just needs to keep track of the `TOP 1`) but does require a memory grant unlike the stream aggregate. — Martin Smith, Jul 02 '12 at 06:23
Though I suspect the `TOP 1` sort operator is over costed in the plan. I tried with `TOP 1`, `TOP 100`, and `TOP 101` and all gave me the same estimated subtree cost despite the fact that the last one would need to sort all the rows. — Martin Smith, Jul 02 '12 at 06:53
That really is a good and comprehensive answer. Would accept that twice if I could. Thanks — Ole Albers, Jul 02 '12 at 11:48

score 6 · Answer 2 · answered Jul 01 '12 at 20:30

6

The first one is certainly clearer in the intention.

There shouldn't be a significant performance difference thought for this specific query (they should actually be pretty much identical, even though the result is different if there are no rows in myTable). Unless you have a valid reason to tune the query (e.g. a proven performance issue), always pick the one which shows the intent of the code.

answered Jul 01 '12 at 20:30

Lucero

59,176
9
122
152

3

An additional advantage is that the first query is also DBMS-agnostic, meaning you could take the same query and run it on nearly any other DBMS and it would still work, whereas the second query is using the SQL-Server-specific `TOP` keyword which is only supported by SQL-Server. – Zane Bien Jul 01 '12 at 20:39

score 4 · Answer 3 · answered Jul 01 '12 at 20:31

4

All query optimizers worth their salt should produce query plans with identical performance for both queries: if there is an index on the column being optimized, both queries should use it; if there is no index, both would produce a full table scan.

answered Jul 01 '12 at 20:31

Sergey Kalinichenko

714,442
84
1,110
1,523

Gulli Meel · Answer 4 · 2012-07-14T10:41:35.117

0

Though I suspect the TOP 1 sort operator is over costed in the plan. I tried with TOP 1, TOP 100, >and TOP 101 and all gave me the same estimated subtree cost despite the fact that the last one >would need to sort all the rows. – Martin Smith Jul 2 at 6:53

Whether you need 1 row or 100 rows the optimizer has to do same amount of work in this example i.e. read all the rows from the table(clustered index scan).Then sort all those rows(sort opertaion) as there is no index on the column C..Finally just display which one are needed.

SELECT TOP (1) b FROM dbo.x ORDER BY b DESC
option(recompile); 
SELECT TOP (100) b FROM dbo.x ORDER BY b DESC
option(recompile);

Try above code and here top 1 and top 100 shows diff cost because there is an index on column b. Thus in this case you do not need to read all rows and sort them but the work is to go to last page pointer.For one row read the last row on last leaf page of index. TFor 100 row find the last row on last page and then start the backward scan till you get the 100 rows.

edited Jul 14 '12 at 10:41

answered Jul 13 '12 at 14:44

Gulli Meel

891
4
6

This is not correct. Please read [the link I gave which explains how the `TOP N` sort works](http://sqlblog.com/blogs/paul_white/archive/2010/08/27/sorting-row-goals-and-the-top-100-problem.aspx). 100 is a magic number between different methods but it doesn't look as though the costing in the plan takes account of this as it costs the same for `TOP 1` vs `TOP 50000` when running against Aaron's demo data. For `TOP 1` it essentially just needs to keep track of the maximum value which is the same as `MAX` needs to do though it is implemented differently. It doesn't need to sort all 50,000 rows. – Martin Smith Jul 14 '12 at 11:35
Without sorting 50000 rows how would you know which is a max value,if the list is say not sorted. There is no index on column C. – Gulli Meel Jul 14 '12 at 14:05
1

By scanning all of them and comparing each with the `TOP 1` value you have seen so far. There is no need to sort the entire set of 50,000 rows into order. – Martin Smith Jul 14 '12 at 14:12
Without sorting 50000 rows how would you know which is a max value,if the list is say not sorted. There is no index on column C.If you see Paul's example he has mentioned that cost and plan will be identitcal.Only issue is that in case of top 101 the query will spill to tempdb.This will slow down top 101 but the cost will remian the same. he mentioned that it will happen for his sample of data where he has a column of char(1000) but in Aaron's case he doesnt have same sample data as Paul.I will look at Pauls' blog and will test with his data and Aaron's data and will let you know the details. – Gulli Meel Jul 14 '12 at 14:50
1

There is no need to sort all the rows from position 2 to 50,000 into order. We don't care about those. `TOP 1` does not do this but `TOP 101` does. This is clear in the post I linked. However the estimated subtree cost does not show any difference in costs despite the fact that the `TOP 1` is doing significantly less work hence my original statement. – Martin Smith Jul 14 '12 at 14:53
That is same thing right except you will keep max 100 rows in your worktable instead of all the rows (which will be the case when you have more than 100 rows). Thus the only diff would be in case of 101 your worktable will have all the rows and based on amount of data it will eat up lots of memory and space on tempdb.But for top 100 rows it will have max 100 rows so that it could be easily fit in memory.Thus smaller mem requirement and no spill to tempodb thus reducing the IO etc.. But your main point was cost which is still the same whether you do 1 ,100 or 101.COST will be same. – Gulli Meel Jul 14 '12 at 14:56
2

Cost will **not** be the same. Scanning all the rows once and remembering the highest you have seen so far has `O(n)` complexity. Sorting all the rows has `O(n log n)` complexity. – Martin Smith Jul 14 '12 at 14:58
The cost is not related to final outcome.It is related to the estimatsion.Both produce identical query plans, which use a range seek to find the rows, and a Top N Sort to perform the sort: Despite this, we find that the two methods have very different performance: – Gulli Meel Jul 14 '12 at 14:59
Above comment could not be edited.He was pointing that while executing it uses different algorithms but during estimating plan it is same and thus the same cost.He is mentioning that despite the same cost the second one top 101 is performing badly as compared to top 100. This difference is based on the algorithm it has chosen.Also he menmtioned that it was for his sample data.For smaller data it might be possible that worktable size never exceed minimu memory grant and thus no issue. – Gulli Meel Jul 14 '12 at 15:03
So you are agreeing with my original point now then? The `TOP 1` is over costed in the plan as SQL Server costs it the same as though it needed to sort all the rows when in fact it doesn't. – Martin Smith Jul 14 '12 at 15:06
It has to sort all the rows but it will keep top 100 in worktable and not all the rows processed till now.Thus the costing is same.In case of top 100 it will say read first 100 rows and sort them and keep in table.Then it will read next row and if it is bigger than any of 100 rows then it will come in worktable and will replace 100th sorted row. if row 1001 will come it will compare against 100 rows and take appropriate action. – Gulli Meel Jul 14 '12 at 15:08
In case of top 101 rows it keeps all the rows in worktable so when 1001 row will come it will compare against all existing 1000 rows and thus it is a costly opertaion and it has to keep much more data.Thus for last row it has to compare against almost all the rows and thus it is very costly.But during estimation optimizer doesnt know which algorithm it will pick and thus estimates same. – Gulli Meel Jul 14 '12 at 15:15

SQL performance MAX()

4 Answers4

Linked

Related