1

I came across a research where the authors hypothesized that x will cause y. The team used self-assessment survey questions to collect all the data on x and y. At time point 1, they asked respondents questions to measure both x and y (x1 and y1). At time point 2, they recontacted the same group respondents but only asked questions used to measure y (y2).

I think, in theory, it is possible that y could cause x, which leads y an endogenous variable.

To test their hypothesis, the authors constructed three regression models:

model 1: y1 = a*x1 + e (a is positive and statistically sig)

model 2: y2 = b*x1 + f (b is positive and statistically sig)

model 3: y2 = cx1 + dy1 + g (both c and d are positive and statistically sig, and c is smaller than a and b).

The authors deemed that model 3 as an AR model and argued that since the coefficient of x1 in model 3 (c) is still positive and significant given y1 is controlled, the effect of x on y is robust and this helps them establish the causal order).

My questions are: 1)Can the AR model (model 3) along with the two simple OLS (model 1 & 2) address the concern of reverse causality? 2)Had the author collected x at time point 2 (x2) and run the same three models with x and y's position switched (x as DV and y as IV), mathematically would it possible for us to find that all coefficients of y are positive and statistically significant, which potentially could be used to support the reversed causal story: x cause y. 3)In model 1 and 2, they found x1 has significant effect on both y1 and y2. Would using y1 and x1 as predictors in model 3 leave room for biased estimate, given the authors' theoretical model? 4)I know that finding an instrumental variable for a two-stage leas squared model is a rather common practice to address endogeneity concerns, can lagged time series variable do the same trick? I found some "ad hoc" solution using lagged time series variable to deal with this problem, but the models usually were specified as:

     y2 = a*x1 + e
     y2 = b*x2 + f 

Did I get it wrong?

Thank you in advance for any insights here.

yzhao
  • 31
  • 4

1 Answers1

0

If I understand your description well, the paper has some assumptions (hypothesis):

  • X_i causes Y_i
  • The effect of X_i -> Y_i is linear
  • X_i also causes Y_{i+1}

Regarding the OLS: A linear model only captures linear correlations. If Y_i causes X_i, you'll still get positive and statistically significant coefficients.

Regarding the AR: If X_i causes (or at least correlates linearly with) Y_i, and Y_i causes (or at least correlates linearly with) Y_{i+1}, X_i causes indirectly (and correlates with) Y_{i+1}. Adjusting for Y_i in would mainly estimate the direct of effect of X_i to Y_{i+1}, thus what remains besides the linear effect of Y_i on Y_{i+1}.

But the argumentation is not complete. X could cause Y, or Y could cause X. Being able to predict Y given X is consistent with X causing Y. But it doesn't disprove that Y causes X, and thus doesn't prove that X causes Y.

In logical reasoning, we have statements A="X causes Y" and B ="a correlation exists between X and Y", where we know that A => B. If one then observes B (be able to fit the above models) doesn't imply A. If A=>B, knowing B doesn't give any information about A. However, if not B, you do know for sure that A isn't true.

Therefore, it seems they'd better assume that "Y causes X", and then do the same analysis. If those estimated effects are not significant, it contradicts with "Y causes X", thus proving that "Y doesn't cause X" (at least not with a linear effect).

If the observations match with X causing Y, and do not match with Y causing X, you could state that X is more likely to cause Y than Y causing X. (with the assumptions that the effect is constant).

gerben
  • 692
  • 4
  • 16