How to specify a priori correlation between samples drawn randomly from two multinomial distributions?

Question

Consider the following game: in each trial, you are presented with x red and y blue dots. You have to decide whether there are more red than blue dots. For each trial, the minimum number of dots in a given color is 10, the maximum is 50. Red and blue dots follow an identical multinomial distribution (for simplicity, let's consider that the probability of occurrence of each integer between 10 and 50 is similar).

I would like to build 300 trials. To do so, I draw 300 samples from each multinomial distribution. importantly, I would like to specify (a priori) the correlation between the 300 samples from the first distribution and the 300 samples from the second distribution. I would like a correlations of -0.8, -0.5, 0, 0.5 and 0.8, in five pairs of sample sets.

Preferably, I would like to also sample this sets so that in each set (X,Y) with any of the specified correlations, half of the X samples will be greater than Y (x(i) > y(i)), and the other half will be smaller than Y (x(i) < y(i)).

How can I do that in python, R or MATLAB?

You start with _red_ and _blue_ dots, and suddenly they become _green_? — EBH, Jul 26 '16 at 19:50
Use copulas? Does [this](http://stackoverflow.com/a/37515473/5540279) answer your question? — nirvana-msu, Jul 26 '16 at 20:24

EBH · Answer 1 · 2016-07-29T08:43:34.567

Basically you ask how to create 2 vectors with a specified correlation, so it is more statistics than programing question, but it can be done in the following way:

step 1 - creating two vector with the desired correlation

r = 0.75;                % r is the desired correlation
M = rand(10000,2);       % two vectors from uniform distribution between 0 to 1
R = [1 r; r 1];
L = chol(R);             % this is Cholesky decomposition of R
M = M*L;                 % when multiplied by M it gives the wanted correlation
M = (M+abs(min(M(:))));  % shift the vector to only positive values
M = M./max(M(:));        % normalize the vector...
M = round(40*M)+10;      % ...to values between 10 to 50
disp([min(M(:)) max(M(:))])
first_r = corr( M(:,1), M(:,2));      % and check the resulted correlation

The rand function could be changed to any random generated numbers function, like randi or randn, and if some specific distribution is desired, it could be obtained using the it's cdf.

step 2 - Sampling these vectors for two sets of samples, one with x>y and one with y>x

x = M(:,1);
y = M(:,2);
Xy = x>y;                % logical index for all x > y
Yx = y>x;                % logical index for all y > x
xy1 = datasample([x(Xy) y(Xy)],150,'Replace',false); % make a 1/2 sample like Xy
xy2 = datasample([x(Yx) y(Yx)],150,'Replace',false); % make a 1/2 sample like Yx
x = [xy1(:,1);xy2(:,1)];           % concat the smaples back to x
y = [xy1(:,2);xy2(:,2)];           % concat the smaples back to y
checkx = sum(x>y)                  % how many times x is bigger than y
checky = sum(y>x)                  % how many times y is bigger than x
final_r = corr(x,y)                % and check the new correlation

step 3 - correcting the correlation

As you'll see the final_r is not like the desired r, so in order to get it you have to shift the first r by its distance from the final_r. Here's an example - first the output when r = 0.75:

    10    50
checkx =
   150
checky =
   150
final_r =
      0.67511

we see that the final_r is shifted down by 0.074886, so we want to shift the original r up by this value to get our final_r correct. So if we run it again with r = 0.75+0.074886, we get:

    10    50
checkx =
   150
checky =
   150
final_r =
      0.76379

which is fairly close to the desired r. I would run a loop over the process for , say, 1000 times to find the closest r to the desired one, or simply set a threshold that continue to search until the final_r is close enough to what you want.

@EBHThis is almost perfect, I really appreciate. May I ask for two refinements? First, could it be possible to adjust the code such that x > y on 50% of trials and x < y for the remaining trials? Second, seems that I need to specify the matrix R in a different way for negative correlations. Say I want a correlation of - 0.8. Setting R = [1 -0.8; -0.8 1] yields values that are no longer comprised between 10 to 50. Any idea? — user1363251, Jul 26 '16 at 21:56
thanks again for the precious help. Doesn't seem to work. At the end of the code I added: Xy = x>y; checkx = length (find(Xy == 1)); Yx = y>x; checky = length(find(Yx ==1)); but checkx and checky are very different, indicating that x is not superior to y on 50% of trials. Any idea before I edit my initial post? — user1363251, Jul 27 '16 at 13:29
Thanks! I've thought about this trick. The use of datasample() after the cholesky decomposition lowers the correlation. But the problem can be solved by looping over subsamples to find the closest r. I'll now edit my initial post, I think this interaction will be useful to others. — user1363251, Jul 27 '16 at 15:53
@user1363251 I'm glad it helped you, and please consider [accepting and/or voting this answer](http://stackoverflow.com/help/someone-answers) so futre readers will know it was helpful. — EBH, Jul 27 '16 at 17:36
I've now thoroughly tested your code, and I see two issues. I'll open a new thread to solve these problems and generalize the method you provided. Thanks again for the precious help. — user1363251, Jul 29 '16 at 18:24

How to specify a priori correlation between samples drawn randomly from two multinomial distributions?

1 Answers1