2

Expression rand(checksum(newid())) is often used to generate random numbers.

While generating some test data, I executed following statement:

select rand(checksum(newid())) R1, rand(checksum(newid())) R2
from ftSequence(3)

where ftSequence(N) is a table function returning single column N and values 1, 2, 3 ... N in its rows (as much as argument N is). Running this resulted to quite expected data:

R1                     R2
---------------------- ----------------------
0,817                  0,9515
0,3043                 0,3947
0,5336                 0,7963

Then it was necessary to find sum in each column, and I did:

select sum(rand(checksum(newid()))) S1, sum(rand(checksum(newid()))) S2
from ftSequence(3)

Surprisingly I got the same number in each column:

S1                     S2
---------------------- ----------------------
1,2276                 1,2276

Why it happens? The same behavior for avg, min and max aggregate functions. Is it query optimizer, or do I miss some logic?


More observations after comments.

Placing sum(rand(checksum(newid()))) into CTE or subquery like

select
    (select sum(rand(checksum(newid()))) from ftSequence(3)) S1,
    (select sum(rand(checksum(newid()))) from ftSequence(3)) S2

or

select sum(R1) S1, sum(R2) S2
from (
    select rand(checksum(newid())) R1, rand(checksum(newid())) R2
    from ftSequence(3)
) R

as well as doing a trick like

select
    sum(rand(checksum(newid()))) S1
    , sum(rand(checksum(newid())) + 0) S2
from ftSequence(3)

worked, resulting to different values

S1                     S2                    
---------------------- ----------------------
0,7349                 1,478                 

Happy of that, and needed to produce more than one row of several different avg(rand(checksum(newid()))) from ftSequence(3), I did following

select R.*
from ftSequence(3) S1
    cross join (
        select
            avg(rand(checksum(newid()))) R1,
            avg(rand(checksum(newid())) + 0) R2
        from ftSequence(3)
    ) R

and got following result:

R1                     R2
---------------------- ----------------------
0,6464                 0,4501
0,6464                 0,4501
0,6464                 0,4501

At this point I was not able to answer myself, whether is it correct result, or should values be all random? What are the ways to make all values random?

i-one
  • 5,050
  • 1
  • 28
  • 40
  • 1
    Interesting, here's another test that does not rely on your sequence function: --expected result, different sums SELECT SUM(R1) 'R1 Sum', SUM(R2) 'R2 Sum' FROM ( select rand(checksum(newid())) R1, rand(checksum(newid())) R2, name from sys.objects ) v --suprise result, same sums select sum(rand(checksum(newid()))) S1, sum(rand(checksum(newid()))) S2 from sys.objects – Steve D Jul 03 '13 at 23:42
  • 1
    Interesting Question: Easy Answer ["The optimizer does not guarantee timing or number of executions of scalar functions"](http://connect.microsoft.com/SQLServer/feedback/details/350485/bug-with-newid-and-table-expressions) and in this case it evaluates it once. Whether there is any reliable way of suppressing this and having it re-evaluated I'm not sure. – Martin Smith Jul 03 '13 at 23:43

1 Answers1

1

As I stated in the question, I was need a set of random test data, but not rand() which is uniformly distributed, I was need a set of

select avg(rand(checksum(newid()))) from ftSequence(@n)

that converges to gaussian distribution.

I found, that instead of cross join I can use cross apply statement with additional meaningless check of the outer scope data:

declare @rCnt int, @n int
set @rCnt = 5000000
set @n = 5

select R.*
from ftSequence(@rCnt) S
    cross apply (
        select
            avg(rand(checksum(newid())) + 1e-101) R1,
            avg(rand(checksum(newid())) + 1e-102) R2,
            avg(rand(checksum(newid())) + 1e-103) R3
        from ftSequence(@n)
        where S.N is not NULL
    ) R

However, I'm not sure it can be considered as a reliable approach.

Following could be more reliable alternative:

declare @rCnt int, @n int
set @rCnt = 5000000
set @n = 5

create table #Rand (ValNo int, R1 float, R2 float, R3 float)
create clustered index #IX_Rand on #Rand (ValNo)

insert into #Rand
select
    S.N / @n,
    rand(checksum(newid())) R1,
    rand(checksum(newid())) R2,
    rand(checksum(newid())) R3
from ftSequence(@n * @rCnt) S

select AVG(R.R1), AVG(R.R2), AVG(R.R3)
from #Rand R
group by ValNo
i-one
  • 5,050
  • 1
  • 28
  • 40
  • 1
    I'm not sure that `rand(checksum(newid())` converges to a Gaussian distribution does it? You could use the [Box–Muller transform](http://en.wikipedia.org/wiki/Box_Muller_transform) `SELECT SQRT(-2 * LOG(ABS(CAST(CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) AS FLOAT) / 9223372036854775807))) * COS(2 * PI() * ABS(CAST(CAST(CRYPT_GEN_RANDOM(8) AS BIGINT) AS FLOAT) / 9223372036854775807))` – Martin Smith Jul 06 '13 at 17:25
  • Afaik, `rand(checksum(newid()))` is uniform, but `avg()` of a sequence of them converges to Gaussian (according to Central Limit Theorem). Of course, the less values averaged, the worse approximation is, but it is acceptable in my case. – i-one Jul 06 '13 at 17:46