6

I am trying to understand the syntax of the "ivprobit" function in "ivprobit" package in R. The instruction says:

 Usage
 ivprobit(formula, data)

 Arguments
    formula y~x|y1|x2 whre y is the dichotomous l.h.s.,x is the r.h.s.    
            exogenous variables,y1 is the r.h.s. endogenous variables and 
            x2 is the complete set of instruments
    data    the dataframe

Then it shows the corresponding example:

 data(eco)

 pro<-ivprobit(d2~ltass+roe+div|eqrat+bonus|ltass+roe+div+gap+cfa,eco)

 summary(pro)

If I match with the instruction's explanation,

 y= d2 = dichotomous l.h.s.
 x= ltass+roe+div = the r.h.s. exogenous variables
 y1= eqrat+bonus = the r.h.s. endogenous variables
 x2= tass+roe+div+gap+cfa = the complete set of instruments

I do not understand the difference between x and x2. Also, if x2 is the complete set of instruments, why doesn't it include the endogenous variables y1 as well? It instead additionally includes "gap" and "cfa" variables which are not even shown in x (exogenous variables) or even in y either.

If, let's say, my chosen instrumental variables are indeed "eqrat" and "bonus", how can I construct knowing the difference between x (exogenous variables) and x2 (the complete set of instruments)?

Eric
  • 528
  • 1
  • 8
  • 26

1 Answers1

6

Note that here we are discussing sintax, not the "goodness" of the model, for that kind of question you should refer to https://stats.stackexchange.com/.

Let's use this equation as an example: enter image description here.

As correctly pointed, List item are not really in the equation, it's just an example.

Here:

  • enter image description here is the dependent variable;

  • enter image description here are endogenous variables (one or more) which a are "problematic";

  • enter image description here are exogenous variables (one or more) which are not "problematic";
  • List item are the instruments (one or more) which "help" with the endogenous variables;

Why the endogenous are problematic? Because they are correlated with the error enter image description here, this causes problems with the classic OLS estimation.

enter image description here are the instruments because they have some foundamental proprieties (more here):

  • Independent of the error term;
  • Does not affect enter image description here given enter image description here held constant;
  • Correlated with enter image description here.

In the sintax proposed, we have:

  • x, exogenous, corresponding to enter image description here (not problematic);
  • y1, endogenous, corresponding to enter image description here (problematic);
  • x2, complete set of instruments, corresponding to enter image description here.

In the example you cite, x2 shares some common variables with x, which is the set of exogenous variables (not problematic), plus two more instruments.

The model is using the 3 exogenous variables as instruments, plus two more variables.

I do not understand the difference between x and x2

x2 are the instruments, which may or may not overlap with the set of exogenous variables (x).

if x2 is the complete set of instruments, why doesn't it include the endogenous variables y1 as well?

It mustn't include the endogenous variables, because those are the ones that the equation needs to take care of, using the instruments.


An example:

You want to build a model that wish to predict whether a woman in a two parent household is employed. You have these variables:

  • fem_works, the response or dependent variable;
  • fem_edu, the education level of the woman, exogenous;
  • kids, number of kids of the couple, exogenous;
  • other_income, the income of the household, endogenous (you know this as prior knowledge);
  • male_edu, the education level of the man, instrument (you choose this).

With ivprobit, this would be:

mod <- ivprobit(fem_works ~ fem_edu + kids | other_income | fem_edu + kids + male_edu, data)

other_income is problematic for the model, because you suspect that it is correlated with the error term (other shocks may affect both fem_works and other_income), you decide to use male_edu as an instrument, in order to "alleviate" that problem. (Example taken from here)

RLave
  • 8,144
  • 3
  • 21
  • 37
  • this doesn't seem quite right. for one thing, the instruments `Z` should not appear in the equation for `y` -- indeed if they do (with non-zero coefficients) then they are invalid instruments as they violate the exclusion restriction! Plus its not really true to say that the instruments may or may not overlap with the endogenous variables: the full set of instruments includes all the exogenous covariates plus additional instrument(s). Its unclear from `ivprobit` documentation if this needs to be fully written out or if the design matrix is properly constructed automatically. – gfgm Feb 25 '19 at 12:08
  • 1
    Yes `Z` are not in the equation, I agree, I should rephrase that part. And I agree that the docs are not clear, I just followed the example to answer OP's question, which was sintax-related. My answer was more concerned towards OP's request of more explanation on the sintax, not what is a `probit` model. – RLave Feb 25 '19 at 12:38
  • 1
    Or even what makes a good `instrument`, those questions are not even for this site. Of course, thank you for pointing out some imprecisions. – RLave Feb 25 '19 at 12:47
  • 1
    but even syntactically this is not correct: if the instrumental variable is `male_edu` then the complete set of instruments is `fem_edu + kids + male_edu`. In fact if you try `ivprobit` with the example from the documentation (or any arbitrary example) the model will not run if you do not write out the full set of instruments. – gfgm Feb 26 '19 at 11:41
  • 1
    My mistake, thank you for pointing that out, I added that part in a second moment and I didn't check with the example above. – RLave Feb 26 '19 at 12:58
  • @RLave: Thank you for your answer and sorry that I did not have a chance to make additional comments. I have one question to ask. If I run this model following the required syntax, I get the binary regression results with coefficients excluding the instrumental variable. Is it normal to get the results without the instrumental variable coefficient? If I need the result with the instrumental variable coefficient, do I simply run the additional normal binary regression including the instrumental variable? – Eric Mar 09 '19 at 21:29
  • @RLave: If so, is the point of using this ivprobit to simply see whether the instrumental variable has valid effect while the result (which does not include the instrumental variable coefficient) is not practical to use? I would appreciate if I can get the answers to these questions please. Thank you! – Eric Mar 09 '19 at 21:29
  • I would appreciate if anyone can answer the additional two small questions above if you don't mind. Thank you! – Eric Mar 10 '19 at 20:20
  • 1
    I think that if you carefully read the example reported here (https://www.stata.com/manuals13/rivprobit.pdf) you can find some clarification on your output. It seems normal that you don't get the coef for the instrument. There's a test that STATA shows about the "goodness" of the instrument (Wald test). – RLave Mar 11 '19 at 07:40
  • 1
    I can suggest a similar function in another package, https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/ivreg with maybe more documentation. – RLave Mar 11 '19 at 07:44
  • 1
    If this doesn't make things clear for you, I suggest you ask a detailed question over at https://stats.stackexchange.com/. First try and look for similar Q on "ivprobit" there, then ask if it's still not crystal clear. :) – RLave Mar 11 '19 at 07:47
  • @RLave: Thank you for your response. So do you mean not including the instrumental variable in the reporting regression output is a general practice? If so, how can I show that the result considers the instrumental variable effect in the output? Do I simply explain it by words? If so, how can I do so since the instrumental variable has no coefficients to report? – Eric Mar 11 '19 at 20:26
  • 1
    I'm not sure about this questions, again I think you should ask over at stats.stackexchange, they will be able to answer you better. – RLave Mar 12 '19 at 07:45
  • @RLave: Thank you for your messages. Then if I ask a different one, may I know how I can derive the goodness-of-fits for "ivprobit" outcomes such as chi-square and R square? Previously, when I was using a simple logit model, I used to use the log-likelihood values to derive the chi-square and some deviance measures to get the R square values. However, using "ivprobit" this does not work I'm afraid. Is there an alternative way to do so? – Eric Mar 13 '19 at 22:50
  • 1
    From the looks of it `ivprobit` doesn't give you those values, like for example `ivreg` (https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/ivreg)..So, I'm afraid that you'd have to wait that the author implements those (this seems a fairly new released package after all). – RLave Mar 14 '19 at 07:30
  • 1
    See if you can find another package for ivprobit, I don't understand if `ivreg` supports a binary response, the alternative is to code yourself the functions for the metrics. – RLave Mar 14 '19 at 07:33
  • 1
    @RLave: Thank you so much again for your detailed response. How about "naivereg" package and provide my set up as "binomial" as one of the choice I can make in this function? Will this work for my purpose as well? – Eric Mar 14 '19 at 08:41
  • 1
    This seems a good solution, it's well documented here https://www.rdocumentation.org/packages/naivereg/versions/1.0.1/topics/naivereg – RLave Mar 14 '19 at 08:49
  • @RLave: Thank you again for your concerns, I checked this "naivereg" again but it seems like the "binary" option was only imposed for the endogenous variable not my response variable Y which is different from my purpose (double checked it from the code author). Maybe I will have to focus on using ivprobit to resolve the issue completely. Thank you very much again. – Eric Mar 14 '19 at 14:18
  • @RLave: I find the coefficients’ significances drop as I use the ivprobit model considering the instrumental variable compared to the simple logit model using glm without considering instrumental variable. However I find the desired sign of my endogenous variable’s coefficient using ivprobit model. Is this normal? Is there a way to increase the significances of the coefficients which are dropped by changing my regression from simple logit model to ivprobit model? – Eric Mar 14 '19 at 18:20
  • You should never adapt your model to your theories. I really think that you need to study more what these models are, and what they do. And this is not the place. Cheers. – RLave Mar 15 '19 at 08:23
  • @RLave: Sorry but I know what the models are. I was trying to find if there are any nowadays technical aspect I could have missed to use to get closer to my objective in addition to the economical aspect which is a common approach. I'm sorry to hear you do not know of any. – Eric Mar 15 '19 at 09:51
  • Without knowing more about your data, one can do so much. If you have any question regarding the result you should ask them over at CrossValidated, not here because here we discuss problems with programming, post over there the output from the models and the questions you have – RLave Mar 15 '19 at 09:55
  • @RLave: Is there a reason why R does not have logit model using instrumental variable but only has probit model using instrumental variable? Or does it have logit model with instrumental variable that I wasn't aware of? – Eric Mar 22 '19 at 12:49
  • Hi. Try not to use comment to pose different question, this is not a chat room. If you have any new question, ask them separately. Also, as I said before, https://stats.stackexchange.com/ is more suited for this kind of questions, here they are not exactly on topic. There you also find better answerers. – RLave Mar 29 '19 at 07:33