3

For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x.

For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.

set.seed(1245)
x <- rbinom(1000, 1, 0.6)

I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)

set.seed(1245)

cor <- 0.8 #Correlation 
y <- rnorm(1000, cor*x, sqrt(1-cor^2))

But I can't find a way to generate a binary variable that is correlated to the already existing variable x. I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.

Does anyone know how to do this in an efficient way?

Thanks!

ecl
  • 369
  • 1
  • 15
  • You for clarification: you want `cov(x,y)` to be `0.8`, correct? Does it have to be exactly `0.8` or is about `0.8` acceptable? – TimTeaFan Jan 11 '21 at 10:47
  • It does not need to be exactly 0.8. About 0.8 is fully acceptable. FYI, my current way of generating the continuous variable does not lead to a perfect 0.8 correlation. – ecl Jan 11 '21 at 10:49

2 Answers2

3

If we look at the formula for correlation:

enter image description here

For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.

If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar, then for a given rho, we simplify the above to:

(xy_bar - x_bar^2) / (x_bar - x_bar^2) =  rho

Solve and we get:

xy_bar = rho * x_bar + (1-rho)*x_bar^2

And we can derive a function to flip a number of 1s and 0s to get the result:

create_vector = function(x,rho){

  n = length(x)
  x_bar = mean(x)
  xy_bar = rho * x_bar + (1-rho)*x_bar^2
  toflip = sum(x == 1) - round(n * xy_bar)

  y = x
  y[sample(which(x==0),toflip)] = 1
  y[sample(which(x==1),toflip)] = 0
  return(y)
}

For your example it works:

set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037

There are some extreme combinations of intended rho and p where you might run into problems, for example:

set.seed(111)

res = lapply(1:1000,function(i){
             
              this_rho = runif(1)
              this_p = runif(1)
              x = rbinom(1000,1,this_p)
              data.frame(
                intended_rho = this_rho,
                p = this_p,
                resulting_cor = cor(x,create_vector(x,this_rho))
              )
           })

res = do.call(rbind,res)

ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
2

Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.

set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781

This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.

The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?

Here's another simple version where it flips the variable both ways:

set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908
pseudospin
  • 2,737
  • 1
  • 4
  • 19
  • That seems to be what I was looking for. Thank you! Saved my day! I will award you with the bounty but need to wait another 5 hours. – ecl Jan 11 '21 at 13:08
  • And to your questions, no I do not care about its mean/variance or any other aspects of its distribution. Please, feel free to share if you have any ideas on how to improve the simulation of the continuous variable. – ecl Jan 11 '21 at 13:41
  • A short comment: The only issue I can see is the deterministic factor of ```x``` since only people with ```x == 1``` can get ```z == 1```. The only varying factor is the number who has ```z == 1```. But need to run some analyses to see if that's an issue for me. Once again, thanks for the help! – ecl Jan 11 '21 at 14:19