2

I have a variable with a given distribution (normale in my below example).

set.seed(32)    
var1 = rnorm(100,mean=0,sd=1)

I want to create a variable (var2) that is correlated to var1 with a linear correlation coefficient (roughly or exactly) equals to "Corr". The slope of regression between var1 and var2 should (rougly or exactly) equals 1.

Corr = 0.3

How can I achieve this?

I wanted to do something like this:

decorelation = rnorm(100,mean=0,sd=1-Corr)
var2 = var1 + decorelation

But of course when running:

cor(var1,var2)

The result is not close to Corr!

double-beep
  • 5,031
  • 17
  • 33
  • 41
Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • 1
    Here's a related answer on a different site: http://quant.stackexchange.com/questions/1027/how-are-correlation-and-cointegration-related/1038#1038 – bill_080 Jun 11 '13 at 14:57
  • This is a duplicate of http://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable – eddi Jun 11 '13 at 15:00
  • Related question: http://stackoverflow.com/questions/16122520/how-to-generate-sample-data-with-exact-moments – Vincent Zoonekynd Jun 11 '13 at 15:17

1 Answers1

2

I did something similar a while ago. I am pasting some code that is for 3 correlated variables but it can be easily generalized to something more complex.

Create an F matrix first:

cor_Matrix <-  matrix(c (1.00, 0.90, 0.20 ,
                     0.90, 1.00, 0.40 ,
                     0.20, 0.40, 1.00), 
                  nrow=3,ncol=3,byrow=TRUE)

This can be an arbitrary correlation matrix.

library(psych) 

fit<-principal(cor_Matrix, nfactors=3, rotate="none")

fit$loadings

loadings<-matrix(fit$loadings[1:3, 1:3],nrow=3,ncol=3,byrow=F)
loadings

#create three rannor variable

cases <- t(replicate(3, rnorm(3000)) ) #edited, changed to 3000 cases from 150 cases

multivar <- loadings %*% cases
T_multivar <- t(multivar)

var<-as.data.frame(T_multivar)

cor(var)

Again, this can be generalized. You approach listed above does not create a multivariate data set.

SprengMeister
  • 550
  • 1
  • 4
  • 12
  • Thanks a lot ! Is there a way when having many variables to let some of the correlation in the cor_Matrix free to be anything so that it does not bias the other correlations ? For example, if I want cor(var1,var2) == 0.8, cor(var1,var3) == 0.3. If I write cor(var2,var3) == 1 (or == whatever other number) it will bias the two other correlations. So is there a way to let one correlation unchoosed ? – Remi.b Jun 11 '13 at 18:24
  • hi, I am not entirely sure what you mean but picking two correlations (eg r(xy) and r(yz) should not affect in any way how r(xz) are correlated. So there is no bias. You can leave them uncorrelated. Are you thinking about error? I guess you could add on top of these correlation some error. You could add to each variable a random variable with mean = 0 and and an arbitrary chosen SD. – SprengMeister Jun 11 '13 at 18:33
  • Well if you use these two different cor_Matrix (after setting the seeds) : c(1,0.9,0.2, 0.9,1,0.4, 0.2,0.4,1) and c(1,0.9,0.2, 0.9,1,0.7, 0.2,0.7,0.1) you'll get different results on the correlation between the 2 first variables although their correlation do not change. isn't it ? – Remi.b Jun 11 '13 at 18:46
  • I realize that I might be a bit far from understanding your code. why does this cor_Matrix=matrix(c(rep(1,9)) does yield to very poor correlations ? – Remi.b Jun 11 '13 at 18:50
  • Hi Remi, I think I may understand now. What you are seeing may be sampling error. In my code above I am only creating 150 which leaves some room for a substantial sampling error. I have changed in the code above the sampling error to 3000. Now you can independently change the intercorrelations and they will be close to what you put into the F Matrix. – SprengMeister Jun 11 '13 at 19:16
  • Hi. It is not a matter of sampling error. But I might not understand the cor_Matrix. Shouldn't I enter the values of correlations that I want to get ? When we create a cor_Matrix of nine 1's only, the results of: > cor(var) is not at all a matrix of 1 ! (independently of the sample size). Thanks SprengMeister for taking this time to help me ! – Remi.b Jun 11 '13 at 19:30
  • with all variable correlated at exactly 1.0 the principal component analysis throws and error. Have you changed the number of cases generated to something much higher than 150 cases. I checked the code and am getting results as expected. EDIT: note that the values may not be exactly the values you put into the F Matrix but they should approach them. – SprengMeister Jun 11 '13 at 19:54
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/31604/discussion-between-remi-b-and-sprengmeister) – Remi.b Jun 11 '13 at 20:06