Equation-by-equation instrumental variable regression (control function) in R

Question

I want to do a equation-by-equation instrumental variable (IV) regression with a control function in R (using tidyverse and broom). I want to implement this based on a grouped data frame with a dependent variable, y, an endogenous variable, x, an instrument for this endogenous variable, z1, and an exogeneous variable, z2. Following a Two Stage Least Squares (2SLS) approach, I would run: (1) Regress x on z1 and z2 and (2) Regress y on x, z2 and v(the residuals from (1)). For more details for this approach see: https://www.irp.wisc.edu/newsevents/workshops/appliedmicroeconometrics/participants/slides/Slides_14.pdf. Unfortunately, I am not able to run the second regression without an error (see below).

My data looks like this:

df <- data.frame(
  id = sort(rep(seq(1, 20, 1), 5)),
  group = rep(seq(1, 4, 1), 25),
  y = runif(100),
  x = runif(100),
  z1 = runif(100),
  z2 = runif(100)
)

where id is an identifier for the observations, group is an identifier for the groups and the rest is defined above.

library(tidyverse)
library(broom)

# Nest the data frame
df_nested <- df %>% 
  group_by(group) %>% 
  nest()

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>% 
  mutate(
    fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
    resids = map(fit1, residuals) 
  )

Now, I want to run the second stage regression. I've tried two things.

First:

df_fit %>% 
  group_by(group) %>% 
  unnest(c(data, resids)) %>% 
  do(lm(y ~ x + z2, data = .x))

This produces Error in is.data.frame(data) : object '.x' not found.

Second:

df_fit %>% 
  mutate(
    fit2 = map2(data, resids, ~ lm(y ~ x + z2, data = .x))
  )
  
df_fit %>% unnest(fit2)

This produces: Error: Must subset columns with a valid subscript vector. x Subscript has the wrong type `grouped_df< . If you would work with a larger data set, the second approach would even run into storage problems.

How is this done correctly?

I reformulated the question above in a more general style (with a focus on the inclusion of elements from previous regressions in the final regression). You can find it here: https://stackoverflow.com/questions/70287136/multi-step-regression-with-broom-and-dplyr-in-r. — timm, Dec 10 '21 at 08:00

score 2 · Accepted Answer · answered Dec 06 '21 at 17:06

The broom package is loaded but there was no tidy applied to the lm output. In addition, the OP's code had some typos i.e. after mutateing to create the fit2, the object 'df_fit' was not updated (<-), thus df_fit %>% unnest(fit2) wouldn't work as the column is not found

library(dplyr)
library(purrr)
library(broom)
library(tidyr)
df_fit %>%
    ungroup %>% 
    mutate(
    fit2 = map2(data, resids, ~ tidy(lm(y ~ x + z2, data = .x))
  )) %>% 
   unnest(fit2)

-output

# A tibble: 12 × 9
   group data              fit1   resids     term        estimate std.error statistic  p.value
   <dbl> <list>            <list> <list>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1     1 <tibble [25 × 5]> <lm>   <dbl [25]> (Intercept)  0.357       0.126    2.82   0.00987 
 2     1 <tibble [25 × 5]> <lm>   <dbl [25]> x           -0.0290      0.173   -0.168  0.868   
 3     1 <tibble [25 × 5]> <lm>   <dbl [25]> z2           0.204       0.183    1.11   0.278   
 4     2 <tibble [25 × 5]> <lm>   <dbl [25]> (Intercept)  0.470       0.139    3.38   0.00272 
 5     2 <tibble [25 × 5]> <lm>   <dbl [25]> x            0.168       0.206    0.816  0.423   
 6     2 <tibble [25 × 5]> <lm>   <dbl [25]> z2           0.00615     0.176    0.0350 0.972   
 7     3 <tibble [25 × 5]> <lm>   <dbl [25]> (Intercept)  0.625       0.147    4.25   0.000325
 8     3 <tibble [25 × 5]> <lm>   <dbl [25]> x            0.209       0.255    0.818  0.422   
 9     3 <tibble [25 × 5]> <lm>   <dbl [25]> z2          -0.398       0.183   -2.18   0.0406  
10     4 <tibble [25 × 5]> <lm>   <dbl [25]> (Intercept)  0.511       0.235    2.17   0.0407  
11     4 <tibble [25 × 5]> <lm>   <dbl [25]> x            0.0468      0.247    0.189  0.851   
12     4 <tibble [25 × 5]> <lm>   <dbl [25]> z2          -0.0246      0.271   -0.0908 0.929

I have a follow-up question: should't you include .y["resids"] into fit2 = map2(data, resids, ~ tidy(lm(y ~ x + z2 + .y["resids"], data = .x)) so that the residual vector is considered in the regression formula? — timm, Dec 07 '21 at 10:40
@timm You may have to construct the formula with `paste` or use `reformulate`. Can you please ask as a new question — akrun, Dec 07 '21 at 21:14

Equation-by-equation instrumental variable regression (control function) in R

1 Answers1