Is there any way to generate uncorrelated random variables using Python?

Question

Suppose I want to generate two random variables X and Y which are uncorrelated and uniformly distributed in [0,1].

The very naive code to generate such is the following, which calls the random function twice:

import random 
xT=0 
yT=0 
xyT=0 
for i in range(20000):
    x = random.random()
    y = random.random()
    xT += x
    yT += y
    xyT += x*y

xyT/20000-xT/20000*yT/20000

However, the random number is really a pseudo-random number which is generated by a formula, therefore they are correlated.

How to generate two uncorrelated (or correlated as little as possible) random variables?

How exactly do you think they are correlated? Given an `x`, can you infer what value `y` will have? Why do you think pseudo-random numbers aren't good enough? — tobias_k, Dec 31 '16 at 13:40
@tobias_k I've calculated the cov(X,Y) by code. That's the definition of correlation. — an offer can't refuse, Dec 31 '16 at 13:42
It is an _estimate_ which has _an uncertainty_. In the long run, ie millions of experiments on really long vectors, the answer will converge to zero. That is the meaning of _asymptotic_, in a nutshell. — Dirk Eddelbuettel, Dec 31 '16 at 14:18
@DirkEddelbuettel Every time I got approximate the same number, you can try it yourself and see if it will *asymptotic* — an offer can't refuse, Dec 31 '16 at 14:25
@buzhidao is the `xyT/2000-xT/2000*yT/2000` bit at the bottom supposed to be the correlation? With the formula as you have written it, I get values close to -22, which is obviously wrong, since correlation has the range [-1, 1]. If I try making a list of all the generated `x`s and `y`s, and then use `numpy.corrcoef(xs, ys)[0, 1]` to get the correlation coefficient, the correlation varies randomly but pretty consistently has absolute value less than 0.02, and frequently less than 0.01. If I increase the number of samples, the correlations are even lower. — zstewart, Dec 31 '16 at 15:11
@zstewart I've made a typo, it is covariance. I've corrected the code. — an offer can't refuse, Jan 01 '17 at 02:46

Dirk Eddelbuettel · Answer 1 · 2016-12-31T14:34:27.970

7

The math on RNGs is solid. These days most popular implementations are too. As such, your conjecture of

is generated by a formula, therefore they are correlated.

is incorrect.

But if you really truly deeply think that way, there is an out: hardware random number generators. The site at random.org has been providing hardware RNG draws "as a service" for a long time. Here is an example (in R, which I use more, but there is an official Python client):

R> library(random)
R> randomNumbers(min=1, max=20000)    # your range, default number
         V1    V2    V3    V4    V5
 [1,]   532 19452  5203 13646  5462
 [2,]  4611 10814  3694 12731   566
 [3,] 11884 19897  1601 10652   791
 [4,] 17427  9524  7522  1051  9432
 [5,]  5426  5079  2232  2517  4883
 [6,] 13807  9194 19980  1706  9205
 [7,] 13043 16250 12827  2161 10789
 [8,]  7060  6008  9110  8388  1102
 [9,] 12042 19342  2001 17780  3100
[10,] 11690  4986  4389 14187 17191
[11,] 19574 13615  3129 17176  5590
[12,] 11104  5361  8000  5260   343
[13,]  7518  7484  7359 16840 12213
[14,] 14914  1991 19952 10127 14981
[15,] 13528 18602 10182  1075 16480
[16,]  9631 17160 19808 11662 10514
[17,]  4827 13960 17003   864 11159
[18,]  8939  7095 16102 19836 15490
[19,]  8321  6007  1787  6113 17948
[20,]  9751  7060  8355 19065 15180
R>

Edit: The OP seems unconvinced, so there is a quick reproducible simulation (again, in R because that is what I use):

R> set.seed(42)               # set seed for RNG
R> mean(replicate(10, cor(runif(100), runif(100))))
[1] -0.0358398
R> mean(replicate(100, cor(runif(100), runif(100))))
[1] 0.0191165
R> mean(replicate(1000, cor(runif(100), runif(100))))
[1] -0.00117392
R>

So you see that as we go from 10 to 100 to 1000 replications of just 100 U(0,1), the correlations estimate goes to zero.

We can make this a little nice with a plot, recovering the same data and some more:

R> set.seed(42)
R> x <- 10^(1:5)   # powers of ten from 1 to 5, driving 10^1 to 10^5 sims
R> y <- sapply(x, function(n) mean(replicate(n, cor(runif(100), runif(100)))))
R> y    # same first numbers as seed reset to same start
[1] -0.035839756  0.019116460 -0.001173916 -0.000588006 -0.000290494
R> plot(x, y, type='b', main="Illustration of convergence towards zero", log="x")
R> abline(h=0, col="grey", lty="dotted")

edited Dec 31 '16 at 14:34

answered Dec 31 '16 at 13:42

Dirk Eddelbuettel

360,940
56
644
725

did not know about random.org. looks interesting! – hiro protagonist Dec 31 '16 at 14:07
you mean in my code `x` and `y` are uncorrelated with each other? I really doubt that, if they are uncorrelated, the returned value should be zero – an offer can't refuse Dec 31 '16 at 14:14
Would you please elaborate why my conjecture is incorrect? – an offer can't refuse Dec 31 '16 at 14:16
2

I really hope your Computer Science or Mathematics instructors do not read this comment. Please read up on tests for random number generators, the site on the [dieharder suite by Robert Brown et al](https://www.phy.duke.edu/~rgb/General/dieharder.php) is one possible start, there are _many_ others. Computers have been used for simulations _using uncorrelated random numbers_ since John von Neumann in the 1940s. There are literally libraries full of dissertations and papers you could read. – Dirk Eddelbuettel Dec 31 '16 at 14:16
@DirkEddelbuettel You still not explaining why pseudo random numbers generated by a particular routine will be uncorrelated, it is simply not. – an offer can't refuse Dec 31 '16 at 14:27
4

You will become rich and famous the day *you* prove that they are in fact correlated. (Hint: you won't.) In short, RNG tests can show absence (or in the bad generator case, presence) of failure. Have a read of eg [this piece by Marsaglia (from 1968 !!) demonstrating that an old generator was poor](http://www.ics.uci.edu/~fowlkes/class/cs177/marsaglia.pdf). Your conclusion of "it is on a computer and hence deterministic and hence measurably correlated" is sadly still incorrect. – Dirk Eddelbuettel Dec 31 '16 at 14:38
glad to know I'm incorrect. I will look into the resources you provided, thanks. – an offer can't refuse Dec 31 '16 at 14:44
I was told once (by an econometrics Prof.) that random number generators are only random for some finite number of draws and after these draws randomness is not a given anymore. (I am aware that this number is beyond anything feasible but this is rather a hypothetical question). Is that true let's say for Rs random draws? – Daniel Winkler Jun 21 '17 at 21:59
True in principle as computers are deterministic, and resolutions are finite. But it depends on your definition of 'some finite number' -- these are absurdly large. Eg even for the good old Mersenne Twister (fancy and new when I was in school many moons ago) it is 2^19337 -1 -- see https://en.wikipedia.org/wiki/Mersenne_Twister – Dirk Eddelbuettel Jun 21 '17 at 22:02

Rory Daulton · Answer 2 · 2016-12-31T14:54:43.193

Short answer: Use a Bays-Durham shuffle on your random seeds.

Longer answer:

I'm sure you know that the pseudo-random numbers given by computer algorithms are not truly random--they are just meant to pass most tests for randomization and thus be "good enough" for most uses. The same is true for uncorrelated random variables: you will never get truly uncorrelated ones, but your goal should be to get them to pass as many correlation tests as possible and be "good enough" for your purposes.

The main way that the standard, linear congruential modulators fail correlation tests is when you look at a small region of the 2-space generated by the numbers. The pairs of numbers show obvious lines when graphed and thus are not truly uncorrelated. This only matters when you look at a very small region of all the generated number pairs. Is this what you need to do? Note that Python's random() function uses the "Mersenne Twister" rather than a linear congruential modulator, so it is less likely to fail such a test. See Wikipedia's list of the disadvantages of the Mersenne Twister to see if Python's random number generator is or is not good enough for your purposes. Note that Python's implementation is shown in detail later in the page.

I wrote routines in Borland Delphi (Object Pascal and x86 assembler) to avoid correlation. I have switched to Python but have not yet rewritten those routines. The idea of a Bays-Durham shuffle is to use the built-in random number generator to give you a random integer (the one that is used to make a floating-point number between 0 and 1). You then use that integer to point into an array of previously-generated random integers. You pick the integer pointed to and replace it in the array with your newly generated integer. Return the integer that used to be in the array, and if need be convert that to a number between 0 and 1.

I implemented this with an array of 32 integers and tested this new generator. This now passed the correlation test that Delphi's random number generator failed. I repeat, this would not pass all correlation tests, but it did pass more than the standard generator, and it was definitely good enough for my uses.

If you need to see a Python implementation of this, ask and I'll try to make the time to write one. Until then, look up "Bays-Durham shuffle". I learned of it from the book Numerical Recipes. Here is a Fortran version of that chapter. Here is the entire 2nd edition in C in Empanel format and here it is in PDF--find Chapter 7 Section 7.1. Source code for out-of-date editions is available in a variety of languages, including Fortran (I think), C, and Pascal. I downloaded the 2nd edition C version text and 1st edition Pascal code some years ago and used those for my own coding in Pascal.

Is there any way to generate uncorrelated random variables using Python?

2 Answers2

Linked