4

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).

I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.

Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?

Jonas Lindeløv
  • 5,442
  • 6
  • 31
  • 54
  • Isn't this trivial? `x <- 1:100; y <- cbind(1,x) %*% c(1.5, 4); lm(y ~ x)`. Or am I missing something? – Roland Aug 05 '19 at 14:33
  • @Roland, yes, sorry for being unclear. I just updated the question to specify that it should have normally distributed residuals. – Jonas Lindeløv Aug 05 '19 at 14:37

2 Answers2

3

One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.

x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)

x <- c(x, x)
y <- c(y + eps, y - eps)

fit <- lm(y ~ x)
# (Intercept)            x  
#         1.5          4.0 

plot(fit)

Residuals are normally distributed...

enter image description here

... but exhibit an anormally perfect symmetry!

enter image description here

EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.

library(dplyr)

# Data and residuals
df = tibble(
  # Predictors
  x1 = 1:100,  # Continuous
  x2 = rep(c(0, 1), each=50),  # Dummy-coded categorical

  # Generate y from model, including interaction term
  y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
  noise = rnorm(100)  # Residuals
)

# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
  df %>% mutate(y = y_model + noise),
  df %>% mutate(y = y_model - noise)  # Mirrored
)

# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept)          x1          x2       x1:x2 
#     1.50000     4.00000    -2.10000     8.76543 

Jonas Lindeløv
  • 5,442
  • 6
  • 31
  • 54
asachet
  • 6,620
  • 2
  • 30
  • 74
2

You could do rejection sampling:

set.seed(42)
tol <- 1e-8

x <- 1:100
continue <- TRUE
while(continue) {
  y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
  if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}

coef(lm(y ~ x))
#(Intercept)           x 
#   1.500013    4.000023

Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Roland
  • 127,288
  • 10
  • 191
  • 288