7

I would like to simulate data for a logistic regression where I can specify its explained variance beforehand. Have a look at the code below. I simulate four independent variables and specify that each logit coefficient should be of size log(2)=0.69. This works nicely, the explained variance (I report Cox & Snell's r2) is 0.34.

However, I need to specify the regression coefficients in such a way that a pre-specified r2 will result from the regression. So if I would like to produce an r2 of let's say exactly 0.1. How do the coefficients need to be specified? I am kind of struggling with this..

# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                0.25,1,0.25,0.25,   
                0.25,0.25,1,0.25,    
                0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4) 
n.obs <- 500000 

library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Create latent continuous response variable 
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4

# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)

# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)

The output is:

Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.7536  -0.7795  -0.0755   0.7813   3.3382  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.002098   0.003544  -0.592    0.554    
V1           0.691034   0.004089 169.014   <2e-16 ***
V2           0.694052   0.004088 169.776   <2e-16 ***
V3           0.693222   0.004079 169.940   <2e-16 ***
V4           0.699091   0.004081 171.310   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 693146  on 499999  degrees of freedom
Residual deviance: 482506  on 499995  degrees of freedom
AIC: 482516

Number of Fisher Scoring iterations: 5

And Cox and Snell's r2 gives:

library(pscl)
pR2(logreg)["r2ML"]

> pR2(logreg)["r2ML"]
 r2ML 
0.3436523 
Lion Behrens
  • 199
  • 1
  • 11
  • why do you say `independent` variables? – RolandASc Mar 19 '18 at 10:35
  • The object sample1 consists of the four x-variables that are used as predictors of y in the regression. They are drawn from their population mean vector mu.1 and covariance matrix sigma.1 using the mvrnorm function. Does that help? – Lion Behrens Mar 19 '18 at 10:45
  • I think this question is seriously underspecified. I'm assuming your would not find an answer satisfying that suggested only correlation of Y to X1 with zero correlation of X2, X3, X4 to both Y and all the other predictors. – IRTFM Mar 19 '18 at 21:02

2 Answers2

3

If you add a random error term to the ystar variable making ystat.r and then work with that, you can tweek the standard deviation until it meets you specifications.

sample1$ystar.r <- sample1$ystar+rnorm(n.obs, 0, 3.8)  # tried a few values
sample1$prob <- exp(sample1$ystar.r) / (1 + exp(sample1$ystar.r))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)  # the estimates "shrink"
pR2(logreg)["r2ML"]
#-------
     r2ML 
0.1014792
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

R-squared (and its variations) is a random variable, as it depends on your simulated data. If you simulate data with the exact same parameters multiple times, you'll most likely get different values for R-squared each time. Therefore, you cannot produce a simulation where the R-squared will be exactly 0.1 just by controlling the parameters.

On the other hand, since it's a random variable, you could potentially simulate your data from a conditional distribution (conditioning on a fixed value of R-squared), but you would need to find out what these distributions look like (math might get really ugly here, cross validated is more appropriate for this part).

VFreguglia
  • 2,129
  • 4
  • 14
  • 35
  • Hi Freguglia, Thanks for your answer! The latter part of your comment is exactly what I aim to do. Of course R2 is subject to sampling variation just as every parameter is, but there is certainly an expected or "true value" of R2 given the coefficients. Since R2 in logistic regression is a function of the full model's and null model's likelihood, I basically need to determine how the coefficients relate to these likelihood values I think... – Lion Behrens Mar 15 '18 at 14:43