-2

I'm looking for the official T-SQL documentation for "ORDER BY RAND()" and "ORDER BY NEWID()". There are numerous articles describing them, so they must be documented somewhere.

I'm looking for a link to an official SQL Server documentation page like this: http://technet.microsoft.com/en-us/library/ms188385.aspx

CLARIFICATION:

What I'm looking for is the documentation for "order_by_expression" that explains the difference in behavior between a nonnegative integer constant, a function that returns a nonnegative integer, and a function that returns any other value (like RAND() or NEWID()).


ANSWER:

I appologize for the lack of clarity in my original question. As with most programming-related problems, the solution to the problem is primarily figuring out what question you're actually trying to answer.

Thank you everyone.


The answer is in this document: From: http://www.wiscorp.com/sql200n.zip

Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation)

22.2 <direct select statement: multiple rows> includes a <cursor specification>.

At this point we have the first half of the answer:

A SELECT statment is a type of CURSOR, which means that operations can be performed iteratively on each row. Although I haven't found a statement in the docs that explicity says it, I'm content to assume that the expression in the order_by_expression will be executed for each row.

Now it makes sense what is happening when you use RAND() or NEWID() or CEILING(RAND() + .5) / 2 as opposed to a numeric constant or a column name.
The expression will never be treated like a column number. It will always be a value that is generated for each row which will be used as the basis for determining the order of the rows.

However, for thoroughness, let's continue to the full definition of what an expression can be.

14.3 <cursor specification> includes ORDER BY <sort specification list>.

10.10 <sort specification list> defines:

<sort specification> ::= <sort key> [ <ordering specification> ] [ <null ordering> ]
    <sort key> ::= <value expression>
    <ordering specification> ::= ASC | DESC
    <null ordering> ::= NULLS FIRST | NULLS LAST

Which takes us to:

6.25 <value expression>

Where we find the second half of the answer:

<value expression> ::= 
      <common value expression> 
    | <boolean value expression> 
    | <row value expression>

<common value expression> ::= 
      <numeric value expression> 
    | <string value expression>
    | <datetime value expression>
    | <interval value expression>
    | <user-defined type value expression>
    | <reference value expression>
    | <collection value expression>

    <user-defined type value expression> ::= <value expression primary>
    <reference value expression> ::= <value expression primary>
    <collection value expression> ::= <array value expression> | <multiset value expression>

From here we descend into the numerous possibile types of expressions that can be used.

NEWID() returns a uniqueidentifier.
It seems reasonable to assume that uniqueidentifiers are compared numerically, so if expression is NEWID() our <common value expression> will be a <numeric value expression>.

Similarly, RAND() returns a numeric value, and it will also be evaluated as a <numeric value expression>.

So, although I wasn't able to find anything in Microsoft's offical documentation that explains what ORDER BY does when called using an order_by_expression that is an expression, it really is documented, as I knew it must be.

Anechoic
  • 15
  • 3
  • 1
    [RAND](http://technet.microsoft.com/en-us/library/ms177610.aspx), [NEWID](http://technet.microsoft.com/en-us/library/ms190348.aspx), [ORDER BY](http://technet.microsoft.com/en-us/library/ms188385.aspx) – asawyer Jan 31 '14 at 18:39
  • 1
    Why do you need an official documentation when you can simply test that it is allowed? – ypercubeᵀᴹ Feb 06 '14 at 19:10
  • What exactly are you after, specific links to the `ORDER BY`, `RAND()`, `NEWID()` and `` documentation? – ypercubeᵀᴹ Feb 06 '14 at 19:16
  • The order_by_expression _"Specifies a column or *expression* on which to sort the query result set."_ It does not say what an "expression" does. If the expression is a nonnegative integer, it is (supposedly) a column number. If it is an expression like 'RAND()' it is some kind of index. Given an expression of 'CEILING(RAND() + .5) / 2' (which returns either 0.5 or 1), is it a column number or an index? That could do two completely different things based on the value that is returned by the expression. That's a _bizarre_ design. That's why I'm wanting to see it documented. – Anechoic Feb 07 '14 at 20:13
  • Since my question has been interpreted too literally, I have updated it to be more exacting. – Anechoic Feb 11 '14 at 00:44
  • I have found what I was looking for. I've posted a link to the ANSWER section that I have added to my original post. – Anechoic Feb 12 '14 at 23:40
  • 5
    This is bizarre to me. For one, you wanted official documentation from Microsoft, and the "answer" you've for some reason added to the question cites some document from some domain name from a company I've never heard of and which I'm not even tempted to download. Next, I'm not sure what you've identified in that doc that hasn't already been explained to you. Finally, your explanation ignores the fact that different expressions behave differently (rand() is not evaluated per row, for example). – Aaron Bertrand Feb 13 '14 at 00:41
  • 1
    @Anechoic The newly edited part that you added recently looks like an answer and not like a question. So, please remove it from here and add it as an **answer**. – ypercubeᵀᴹ Feb 13 '14 at 09:47
  • 1
    I (personally) do not think that it answers the original question you posed but it is an answer, it should not be part of the question. – ypercubeᵀᴹ Feb 13 '14 at 09:48

3 Answers3

8

If you're trying to determine why these behave differently, the reason is simple: one is evaluated once, and treated as a runtime constant (RAND()), while the other is evaluated for every single row (NEWID()). Observe this simple example:

SELECT TOP (5) RAND(), NEWID() FROM sys.objects;

Results:

0.240705716465209        8D5D2B55-E5DE-4FF9-BA84-BC82F37B8F3A
0.240705716465209        C4CBF1CA-E6D0-4076-B6A6-5048EA612048
0.240705716465209        9BFAE5BB-B5B9-47DE-B8F9-77AAEFA5F9DB
0.240705716465209        89FFD8A1-AC73-4CEB-A5C0-00A76D040382
0.240705716465209        BCC89923-735E-43B3-9ECA-622A8C98AD7D

Now, if you apply an order by to the left column, SQL Server says, ok, but every single value is the same, so I'm basically just to ignore your request and move on to the next ORDER BY column. If there isn't one, then SQL Server will default to returning the rows in whatever order it deems most efficient.

If you apply an order by to the right column, now SQL Server actually has to sort all of the values. This introduces a Sort (or a TopN Sort if TOP is used) operator into the plan, and is likely going to take more CPU (though overall duration may not be substantially affected, depending on the size of the set and other factors).

Let's compare the plans for these two queries:

SELECT RAND() FROM sys.all_columns ORDER BY RAND();

The plan:

enter image description here

There is no sort operator going on, and both of the scans are Ordered = False - this means that SQL Server has not decided to explicitly implement any ordering, but this certainly does not mean that the order will be any different on each execution - it just means that the order is non-deterministic (unless you add a secondary ORDER BY - but even in that case, the RAND() ordering is still ignored because, well, it's the same value on every row).

And now NEWID():

SELECT NEWID() FROM sys.all_columns ORDER BY NEWID();

The plan:

enter image description here

There is a new Sort operator there, which means that SQL Server must reorder all the rows to be returned in the order of the generated GUID values on each row. The scans of course are still unordered, but the Sort ultimately applies the order.

I don't know that this specific implementation detail is officially documented anywhere, though I did find this article which includes an explicit ORDER BY NEWID(). I doubt you'll find anything official that documents ORDER BY RAND() in any way, because that just doesn't make any sense to do, officially supported or not.

Re: the comment that SQL Server assigns a seed value at random - this should not be interpreted as a seed value **per row** at random. Demonstration:

SELECT MAX(r), MIN(r) FROM 
(
  SELECT RAND() FROM sys.all_columns AS s1 
  CROSS JOIN sys.all_columns AS s2
) AS x(r);

Results:

0.4866202638872        0.4866202638872

On my machine, this took about 15 seconds to run, and the results were always the same for both MIN and MAX. Keep increasing the number of rows returned and the amount of time it takes, and I guarantee you will continue to see the exact same value for RAND() on every row. It is calculated exactly once, and that is not because SQL Server is wise to the fact that I am not returning all of the rows. This also yielded the same result (and it took just under 2 minutes to populate the entire table with 72 million rows):

SELECT RAND() AS r INTO #x 
      FROM sys.all_columns AS s1 
CROSS JOIN sys.all_columns AS s2
CROSS JOIN sys.all_columns AS s3;

SELECT MAX(r), MIN(r) FROM #x;

(In fact the SELECT took almost as long as the initial population. Do not try this on a single-core laptop with 4GB of RAM.)

The result:

0.302690214345828        0.302690214345828
Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • Is it that the RAND is only evaluated once, or is it the time seed value did not increment inside that execution time? – asawyer Jan 31 '14 at 18:56
  • `Repetitive calls of RAND() with the same seed value return the same results.` and ` If seed is not specified, the SQL Server Database Engine assigns a seed value at random` from [here](http://technet.microsoft.com/en-us/library/ms177610.aspx) would seem to indicate that the value is not static unless you provide a seed. – asawyer Jan 31 '14 at 19:00
  • 1
    @asawyer you really should read that as `a seed`, not `a seed per row`. – Aaron Bertrand Jan 31 '14 at 19:03
  • Ah, ok I see what you mean. Thanks for clarifying that for me! – asawyer Jan 31 '14 at 19:06
  • These answers are useful, but they don't answer my question (see the original post). – Anechoic Feb 06 '14 at 19:02
  • 1
    @Anechoic Sorry, but these are about as close as you're going to get. [They don't officially list those expressions explicitly, they just say `ORDER BY order_by_expression`](http://technet.microsoft.com/en-us/library/ms188385.aspx). When you look at the documentation for a data type like `INT`, do you need them to explicitly list 42 as a possible value in order for you to believe that you can store 42 in an `INT`? – Aaron Bertrand Feb 06 '14 at 19:06
  • The documentation for **INT** tells you what is valid. That **42** is valid is, therefore, documented. The official documentation of the order_by_expression **precludes** the use of **RAND()** and **NEWID()** as neither one results in a integer that can be used to identify a column. They made it into the product, so they must be documented somewhere. Otherwise, how did anyone come to know about them? In "ORDER BY 1", "1" is a column number. In "ORDER BY RAND()", "RAND()" is a magic index. That makes no sense at all. – Anechoic Feb 06 '14 at 20:44
  • @Anechoic I don't know what to tell you man, it's still not explicitly documented. If you have a problem with that, please take it up with Microsoft. I didn't write the documentation. – Aaron Bertrand Feb 06 '14 at 21:00
2

Check out the links below.

ORDER BY, RAND and NEWID are statement and functions part of the TSQL language.

Combining them to randomly select or generate data is a design pattern.

See the first two articles.

Generate random integers without collisions

http://www.sqlperformance.com/2013/09/t-sql-queries/random-collisions

MSDN - Selecting Rows Randomly from a Large Table

http://msdn.microsoft.com/en-us/library/cc441928.aspx

MSDN - RAND

http://technet.microsoft.com/en-us/library/ms177610.aspx

MSDN - NEWID

http://msdn.microsoft.com/fr-fr/library/ms190348.aspx

MSDN - ORDER BY

http://technet.microsoft.com/en-us/library/ms188385.aspx

Very good read Aaron.

But again, taken separately (RAND, NEWID, ORDER BY) are elements part of a TSQL language.

Using them to randomly choose data is a design pattern.

Also, you can call RAND() in a while loop - RBAR() produce random numbers.

That is because in the query plan, RAND(), is no longer a constant.

-- RBAR solution
declare @x float = 0;
declare @y int = 0;
while (@y < 100)
begin
    set @x = rand();
    print @x;
    set @y += 1;
end;
go

enter image description here

CRAFTY DBA
  • 14,351
  • 4
  • 26
  • 30
  • These answers are useful, but they don't answer my question (see the original post). – Anechoic Feb 06 '14 at 19:05
  • I have found what I was looking for. I've posted a link to the ANSWER section that I have added to my original post. – Anechoic Feb 12 '14 at 23:38
2

If we're being a stickler for details, the question you asked was essentially "Where's the docs for ~". The answer is nowhere, there is no doc like the one you're looking for.

Not a single one anyway, there are multiple docs that treat NEWID(), RAND() and ORDER BY separately and you have to put the pieces together yourself.

Basically,

This lets you know it's valid syntax, but there's no single link for you to point to.

Michael J Swart
  • 3,060
  • 3
  • 29
  • 46
  • What I thought was a clear, focused question clearly wasn't, so I've revised my previous clarification to be more precise. Here is the documentation for ORDER BY: [http://technet.microsoft.com/en-us/library/ms188385.aspx] It explains what happens when you use a nonnegative integer or a list of string literals to specify column names. It does not explain what happens if you use an *expression*. Where is the documentation for what an *expression* actually does? – Anechoic Feb 11 '14 at 16:53
  • I understand what you mean now. But other than it's an expression "on which to sort the query result set". You're not going to find what you're looking for. – Michael J Swart Feb 11 '14 at 17:40
  • 1
    So the answer to "where are the docs that say X" the answer is still "nowhere". – Michael J Swart Feb 11 '14 at 17:57
  • I'm completely stumped by this. The docs don't explain what it means to sort the query result set on an expression and everybody else seems pretty okay with that. So, tell me, if I use an expression like 'CEILING(RAND() + .5) / 2' (which returns either 0.5 or 1) what will happen? How did you come to know the answer (other than by experimentation)? How is anyone to know how to use it correctly if it's not defined somewhere? – Anechoic Feb 11 '14 at 20:38
  • 1
    I think there's a distinction that an integer is different than an expression that evaluates to an integer. It's not meaningful to sort by a constant which is why no one cares. It's also what lets T-SQL support integers that indicate a column. Again, no-one's going to need to sort by CAST(0.5 * 2.0 as int) because it wouldn't matter. Also, I'm done. – Michael J Swart Feb 12 '14 at 15:15
  • I have found what I was looking for. I've posted a link to the ANSWER section that I have added to my original post. – Anechoic Feb 12 '14 at 23:38