1

I am trying to create a sample data set (most of the code is from this question). It is almost how I want it to be. However, there are two things I still want to do, but I cannot figure out.

  1. I would like to create a higher correlation between y and year, without rearranging the whole data set (so by only changing the values of y).

  2. If possible (I currently just manually changed the set.seed() until I got a significant relation), I would like to be able to determine the true correlation between the event and y. (again only y can be changed).

Could someone help me with explaining how to do this?

set.seed(2)

a    <- 2    # structural parameter of interest
b    <- 1    # strength of instrument
rho  <- 0.5  # degree of endogeneity

N    <- 1000
z    <- rnorm(N)
res1 <- rnorm(N)
res2 <- res1*rho + sqrt(1-rho*rho)*rnorm(N)
x    <- z*b + res1
ys   <- x*a + res2
d    <- (ys>0) #dummy variable
y    <- round(10-(d*ys))
random_variable <- rnorm(100, mean = 0, sd = 1)

library(data.table)
DT_1 <- data.frame(y,x,z, random_variable)
DT_2 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 
45, 46, 47, 48, 49, 50), year = c(1995, 1995, 1995, 1995, 1995, 
1995, 1995, 1995, 1995, 1995, 2000, 2000, 2000, 2000, 2000, 2000, 
2000, 2000, 2000, 2000, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 
2010, 2010, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015), Group = c("A", "A", "A", "A", "B", "B", "B", "B", "C", 
"C", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "A", "A", 
"A", "A", "B", "B", "B", "B", "C", "C", "A", "A", "A", "A", "B", 
"B", "B", "B", "C", "C", "A", "A", "A", "A", "B", "B", "B", "B", 
"C", "C"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), win_or_lose = c(-1, 
-1, -1, -1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 1, 1, 1, 1, 0, 0, 
-1, -1, -1, -1, 1, 1, 1, 1, 0, 0)), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))
DT_1 <- setDT(DT_1)
DT_2 <- setDT(DT_2)
DT_2 <- rbind(DT_2 , DT_2 [rep(1:50, 19), ])
sandbox <- cbind(DT_1, DT_2)
Tom
  • 2,173
  • 1
  • 17
  • 44
  • Do you want a negative or a postive corr? – MacOS Jan 07 '21 at 11:08
  • 1
    Thank you for you comment. The best option would be that I can change it around for each year, is that a possibility? If not, positive is fine:) – Tom Jan 07 '21 at 11:11
  • 1
    Thank you for clarifying that. I have another question. Is there anything that I'm not allowed to change? – MacOS Jan 07 '21 at 14:00
  • I would prefer that the changes do not take place in `DT_2` . I am trying to recreate actual data (to see how the results will respond to different regressions, to evaluate the accuracy of those regression). – Tom Jan 07 '21 at 14:05
  • I posted an answer, please have a look. I think that it may not be the thing you are looking for though. – MacOS Jan 07 '21 at 14:29

2 Answers2

3

This approach uses the following idea:

  • Add a deterministic term to y that depends on year. This boosts the correlation a lot. Here, it depends on beta. You can tweak beta to increase the influence of year. Note that I worked with year-mean(year) so that the overall scale of y is not shifted too much. If you don't care about y being shifted, just drop the mean-part.
  • Add some gaussian noise to y. You can tweak the sd parameter to increase the noise, thus decrease the correlation.

I save the result in y2 so that you can play around more easily. When you're satisfied with parameters beta and sd, you can just overwrite y.

noise = rnorm(n = nrow(sandbox), mean = 0, sd = 0.01)
beta = 0.1
sandbox$y2 = sandbox$y + beta * (sandbox$year - mean(sandbox$year)) + noise
cor(sandbox$y2, sandbox$year)

Good luck and please provide feedback or clarification if this is not the desired behavior.

EDIT: Here you can see the behavior of different beta and sigma values:

betas = seq(-.50, .50, by=.10)
sigmas = seq(0.0, 5.0, by=1.0)

M = matrix(data=NA, nrow=length(betas), ncol=length(sigmas))
for (b in 1:length(betas)){
  for (s in 1:length(sigmas)){
    noise = rnorm(n = nrow(sandbox), mean = 0, sd = sigmas[s])
    sandbox$y2 = sandbox$y + betas[b] * (sandbox$year - mean(sandbox$year)) + noise
    M[b,s] = round(cor(sandbox$y2, sandbox$year), 2)
  }
}
rownames(M) = betas
colnames(M) = sigmas
M

resulting in the following matrix output. Rows are beta, columns are sigma, cell value is the correlation of y and year:

         0     1     2     3     4     5
-0.5 -0.86 -0.84 -0.77 -0.66 -0.62 -0.55
-0.4 -0.81 -0.78 -0.70 -0.61 -0.53 -0.47
-0.3 -0.71 -0.68 -0.61 -0.51 -0.45 -0.42
-0.2 -0.56 -0.51 -0.46 -0.32 -0.29 -0.25
-0.1 -0.32 -0.29 -0.25 -0.22 -0.12 -0.08
0     0.01 -0.01  0.00 -0.01  0.01 -0.01
0.1   0.33  0.31  0.24  0.21  0.19  0.12
0.2   0.57  0.52  0.45  0.38  0.33  0.27
0.3   0.72  0.66  0.59  0.48  0.44  0.33
0.4   0.81  0.78  0.71  0.62  0.54  0.48
0.5   0.86  0.84  0.78  0.69  0.61  0.53

EDIT 2: Of course, you can simply have a negative beta to achieve negative correlations. You might also just fix sigma and only adjust beta.

marvinschmitt
  • 346
  • 1
  • 7
  • Thanks very much Marvin, This looks very nice! I will check it out in detail as soon as possible! – Tom Jan 12 '21 at 15:48
  • Hi Tom, happy to hear that. Please let me know if you expect another behavior. The principle of adding a deterministic term and random noise is pretty powerful in modeling correlative relations. – marvinschmitt Jan 12 '21 at 15:51
  • If you found the time to check the answer and it resolved your question, please consider accepting it with the gray check to its left. – marvinschmitt Jan 13 '21 at 16:01
  • 1
    Thank you for reminding me (I know how to accept the answer haha ;)) – Tom Jan 13 '21 at 17:27
1

What about the following solution.

generate.values <- function(year) {
  
  if (year == 1995)
    return(rnorm(1, mean = 50))
  
  if (year == 2000)
    return(rnorm(1, mean = 100))
  
  if (year == 2005)
    return(rnorm(1, mean = 150))
  
  if (year == 2010)
    return(rnorm(1, mean = 200))
  
  if (year == 2015)
    return(rnorm(1, mean = 250))
}

sandbox$y <- apply(sandbox,
                   MARGIN = 1,
                   FUN = function(row) {
                     return(generate.values(row["year"]))
                   })

cor(sandbox$y, sandbox$year)

This gives me a correlation around 0.99. Note how I increase the mean of the normal distribution for each year. If you want to have a negative correlation, then you can simply change the sign of the mean parameter. So the function becomes

generate.values <- function(year) {
  
  if (year == 1995)
    return(rnorm(1, mean = -50))
  
  if (year == 2000)
    return(rnorm(1, mean = -100))
  
  if (year == 2005)
    return(rnorm(1, mean = -150))
  
  if (year == 2010)
    return(rnorm(1, mean = -200))
  
  if (year == 2015)
    return(rnorm(1, mean = -250))
}

Of course you can incorporate that into a single function. We then have

generate.values <- function(year, negative.correlation = FALSE) {
  
  means <- c(50, 100, 150, 200, 250)
  names(means) <- c("1995", "2000", "2005", "2010", "2015")
  
  if (negative.correlation)
    means <- -means
  
  
  if (year == 1995)
    return(rnorm(1, mean = means["1995"]))
  
  if (year == 2000)
    return(rnorm(1, mean = means["2000"]))
  
  if (year == 2005)
    return(rnorm(1, mean = means["2005"]))
  
  if (year == 2010)
    return(rnorm(1, mean = means["2010"]))
  
  if (year == 2015)
    return(rnorm(1, mean = means["2015"]))
}

HTH!

MacOS
  • 1,149
  • 1
  • 7
  • 14
  • Thanks a lot! It is not exactly what I wanted because it replaces `y` completely. But I was thinking that I could perhaps interact your solution with the `y` that I have. Something like `sandbox$y <- sandbox$y*apply(sandbox,...`. Will try that out soon:) – Tom Jan 08 '21 at 14:15
  • Tom ... in your question you said *"by only changing the values of y"*, and here you complain when the suggested code does precisely that. Do you mean something like *"by only **nudging** values of y"* instead? (Where "nudge" is rather subjective.) – r2evans Jan 08 '21 at 21:35
  • @r2evans Sorry, I completely missed your comment. But yes you are right, I only want to "nudge" the values. I have to agree that my explanation in this question was lacking quite a bit. I could not really pin point what I wanted. Essentially I wanted a database in which I could manually control every relation (which I think is a bit paradoxal). – Tom Jan 12 '21 at 15:46