Basically you ask how to create 2 vectors with a specified correlation, so it is more statistics than programing question, but it can be done in the following way:
step 1 - creating two vector with the desired correlation
r = 0.75; % r is the desired correlation
M = rand(10000,2); % two vectors from uniform distribution between 0 to 1
R = [1 r; r 1];
L = chol(R); % this is Cholesky decomposition of R
M = M*L; % when multiplied by M it gives the wanted correlation
M = (M+abs(min(M(:)))); % shift the vector to only positive values
M = M./max(M(:)); % normalize the vector...
M = round(40*M)+10; % ...to values between 10 to 50
disp([min(M(:)) max(M(:))])
first_r = corr( M(:,1), M(:,2)); % and check the resulted correlation
The rand
function could be changed to any random generated numbers function, like randi
or randn
, and if some specific distribution is desired, it could be obtained using the it's cdf.
step 2 - Sampling these vectors for two sets of samples, one with x>y and one with y>x
x = M(:,1);
y = M(:,2);
Xy = x>y; % logical index for all x > y
Yx = y>x; % logical index for all y > x
xy1 = datasample([x(Xy) y(Xy)],150,'Replace',false); % make a 1/2 sample like Xy
xy2 = datasample([x(Yx) y(Yx)],150,'Replace',false); % make a 1/2 sample like Yx
x = [xy1(:,1);xy2(:,1)]; % concat the smaples back to x
y = [xy1(:,2);xy2(:,2)]; % concat the smaples back to y
checkx = sum(x>y) % how many times x is bigger than y
checky = sum(y>x) % how many times y is bigger than x
final_r = corr(x,y) % and check the new correlation
step 3 - correcting the correlation
As you'll see the final_r
is not like the desired r
, so in order to get it you have to shift the first r
by its distance from the final_r
. Here's an example - first the output when r = 0.75
:
10 50
checkx =
150
checky =
150
final_r =
0.67511
we see that the final_r
is shifted down by 0.074886, so we want to shift the original r
up by this value to get our final_r
correct. So if we run it again with r = 0.75+0.074886
, we get:
10 50
checkx =
150
checky =
150
final_r =
0.76379
which is fairly close to the desired r
. I would run a loop over the process for , say, 1000 times to find the closest r
to the desired one, or simply set a threshold that continue to search until the final_r
is close enough to what you want.