-1

I am doing some simulation work. I first used logit to get the probability of treatment for each observation, then use rbniom() to generate the binary treatment variable.

With treatment variable observed, I used glm with logit link to estimate the parameter gamma. It should be 1, but multiple tries (even with sample number increased), it is still around 0.3. Where does the bias come from?

Code is attached

set.seed(99)
n = 10000
for (rv in c('X1','X2', 'Z1', 'Z2','e','u')){
  assign(rv, rnorm(n =n, mean = 0, sd =5))
  # check values
  # get(rv), eval(as.name/symbol(rv))
}
X = cbind(X1,X2)
Z = cbind(Z1,Z2)
gamma = c(1,1)
# treatment probability for each observation
p_treatment = 1/(1+exp(-(X%*%gamma+e)))
# track treated or not
treated = mapply(FUN = rbinom, prob = p_treatment, size = 1, n = 1)
beta = c(1,1)
y = 1 + X%*%beta+treated+u
fit_lgt = glm(treated ~ X, family = binomial(link = 'logit'))
summary(fit_lgt)
Dason
  • 60,663
  • 9
  • 131
  • 148
caden Hong
  • 147
  • 1
  • 10
  • If you need help with model fitting, you should ask the statisticians over at [stats.se]. This really isn't a specific programming question that's appropriate for Stack Overflow. – MrFlick Feb 20 '19 at 19:30
  • Thanks for the comment. I was suspecting either I wrote the wrong code or I adopted the wrong model. – caden Hong Feb 20 '19 at 19:31

2 Answers2

1

The logistic model does not have e term in it. So the p_treatment should be calculated as:

p_treatment = 1/(1+exp(-(X%*%gamma)))

This gets you the correct estimates:

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.01441    0.04304   0.335    0.738    
XX1          1.03875    0.02643  39.297   <2e-16 ***
XX2          1.00852    0.02589  38.951   <2e-16 ***
Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • Hi, thanks for the suggestion. It would be correct without error term, but I was hoping it would be unbiased even under some mean 0 error, just like the case in linear regression... – caden Hong Feb 20 '19 at 19:40
  • @cadenHong Once again - no. It won't be unbiased if you're applying an error before applying the link. – Dason Feb 20 '19 at 19:40
  • @Andrey Shabalin Advice taken. Upvoted. Thanks for your input very much. The bias mainly comes from the error on probability since, as you point out, eliminating the error term would correct the bias. – caden Hong Feb 20 '19 at 20:12
0

This isn't a programming questions as much as a question about understanding your model. I don't particularly like how you coded the simulation but that isn't what I'll address here.

In a generalized linear model you don't add random noise before applying the link. The line that is throwing things off is:

p_treatment = 1/(1+exp(-(X%*%gamma+e)))

You shouldn't be adding extra error so you should change this to:

p_treatment = 1/(1+exp(-(X%*%gamma)))
Dason
  • 60,663
  • 9
  • 131
  • 148
  • Hi, appreciate your answer~ However, most generalized linear models should give the right estimate as long as the error has mean 0 and independent. I will check with CrossValidated if this is not the coding problem. – caden Hong Feb 20 '19 at 19:35
  • @cadenHong You are mistaken. If you're adding error before applying the link then all bets are off. In a generalized linear model you never "add an error term". You need to change the way you're thinking about the model when using a glm as compared to a linear model. – Dason Feb 20 '19 at 19:39
  • @cadenHong As a side note... shouldn't the fact that you were getting 'bad' estimates when adding the error before applying the link at least partially convince you that you might be wrong on this? – Dason Feb 20 '19 at 19:44
  • The randomness is introduced at the data generation step by the `rbinom` function. The binomial values are random. – Andrey Shabalin Feb 20 '19 at 19:44
  • The error term is added as required in my assignment. And it makes sense in econometric simulation since there would always be noise/unobservables. Thanks for the comments and It's true that I should look at the glm mechanism more carefully. – caden Hong Feb 20 '19 at 19:48
  • 1
    My advice to @cadenHong : study the model, understand it, understand where the randomness come from, upvote the useful answers, toss a coin and accept one of them. – Andrey Shabalin Feb 20 '19 at 19:50
  • The 'noise' comes from the rbinom function. Think of simulating a situation where I'm going to throw some darts and record if I get a bullseye. That's already random with a certain probability of success. If our goal is to estimate that probability of success we wouldn't also put me on a table and shake the table while I'm throwing. That would add noise before even applying the event and we wouldn't really be estimating the probability we want to estimate anymore. – Dason Feb 20 '19 at 19:50