0

Consider the following game: in each trial, you are presented with x red and y blue dots. You have to decide whether there are more red than blue dots. For each trial, the minimum number of dots in a given color is 10, the maximum is 50. Red and blue dots follow an identical multinomial distribution (for simplicity, let's consider that the probability of occurrence of each integer between 10 and 50 is similar).

I would like to build 300 trials. To do so, I draw 300 samples from each multinomial distribution. importantly, I would like to specify (a priori) the correlation between the 300 samples from the first distribution and the 300 samples from the second distribution. I would like a correlations of -0.8, -0.5, 0, 0.5 and 0.8, in five pairs of sample sets.

Preferably, I would like to also sample this sets so that in each set (X,Y) with any of the specified correlations, half of the X samples will be greater than Y (x(i) > y(i)), and the other half will be smaller than Y (x(i) < y(i)).

How can I do that in python, R or MATLAB?

EBH
  • 10,350
  • 3
  • 34
  • 59
user1363251
  • 421
  • 1
  • 11
  • 24

1 Answers1

1

Basically you ask how to create 2 vectors with a specified correlation, so it is more statistics than programing question, but it can be done in the following way:

step 1 - creating two vector with the desired correlation

r = 0.75;                % r is the desired correlation
M = rand(10000,2);       % two vectors from uniform distribution between 0 to 1
R = [1 r; r 1];
L = chol(R);             % this is Cholesky decomposition of R
M = M*L;                 % when multiplied by M it gives the wanted correlation
M = (M+abs(min(M(:))));  % shift the vector to only positive values
M = M./max(M(:));        % normalize the vector...
M = round(40*M)+10;      % ...to values between 10 to 50
disp([min(M(:)) max(M(:))])
first_r = corr( M(:,1), M(:,2));      % and check the resulted correlation

The rand function could be changed to any random generated numbers function, like randi or randn, and if some specific distribution is desired, it could be obtained using the it's cdf.

step 2 - Sampling these vectors for two sets of samples, one with x>y and one with y>x

x = M(:,1);
y = M(:,2);
Xy = x>y;                % logical index for all x > y
Yx = y>x;                % logical index for all y > x
xy1 = datasample([x(Xy) y(Xy)],150,'Replace',false); % make a 1/2 sample like Xy
xy2 = datasample([x(Yx) y(Yx)],150,'Replace',false); % make a 1/2 sample like Yx
x = [xy1(:,1);xy2(:,1)];           % concat the smaples back to x
y = [xy1(:,2);xy2(:,2)];           % concat the smaples back to y
checkx = sum(x>y)                  % how many times x is bigger than y
checky = sum(y>x)                  % how many times y is bigger than x
final_r = corr(x,y)                % and check the new correlation

step 3 - correcting the correlation

As you'll see the final_r is not like the desired r, so in order to get it you have to shift the first r by its distance from the final_r. Here's an example - first the output when r = 0.75:

    10    50
checkx =
   150
checky =
   150
final_r =
      0.67511

we see that the final_r is shifted down by 0.074886, so we want to shift the original r up by this value to get our final_r correct. So if we run it again with r = 0.75+0.074886, we get:

    10    50
checkx =
   150
checky =
   150
final_r =
      0.76379

which is fairly close to the desired r. I would run a loop over the process for , say, 1000 times to find the closest r to the desired one, or simply set a threshold that continue to search until the final_r is close enough to what you want.

EBH
  • 10,350
  • 3
  • 34
  • 59
  • @EBHThis is almost perfect, I really appreciate. May I ask for two refinements? First, could it be possible to adjust the code such that x > y on 50% of trials and x < y for the remaining trials? Second, seems that I need to specify the matrix R in a different way for negative correlations. Say I want a correlation of - 0.8. Setting R = [1 -0.8; -0.8 1] yields values that are no longer comprised between 10 to 50. Any idea? – user1363251 Jul 26 '16 at 21:56
  • thanks again for the precious help. Doesn't seem to work. At the end of the code I added: Xy = x>y; checkx = length (find(Xy == 1)); Yx = y>x; checky = length(find(Yx ==1)); but checkx and checky are very different, indicating that x is not superior to y on 50% of trials. Any idea before I edit my initial post? – user1363251 Jul 27 '16 at 13:29
  • Thanks! I've thought about this trick. The use of datasample() after the cholesky decomposition lowers the correlation. But the problem can be solved by looping over subsamples to find the closest r. I'll now edit my initial post, I think this interaction will be useful to others. – user1363251 Jul 27 '16 at 15:53
  • @user1363251 I'm glad it helped you, and please consider [accepting and/or voting this answer](http://stackoverflow.com/help/someone-answers) so futre readers will know it was helpful. – EBH Jul 27 '16 at 17:36
  • I've now thoroughly tested your code, and I see two issues. I'll open a new thread to solve these problems and generalize the method you provided. Thanks again for the precious help. – user1363251 Jul 29 '16 at 18:24