I want to do a equation-by-equation instrumental variable (IV) regression with a control function in R (using tidyverse
and broom
). I want to implement this based on a grouped data frame with a dependent variable, y
, an endogenous variable, x
, an instrument for this endogenous variable, z1
, and an exogeneous variable, z2
. Following a Two Stage Least Squares (2SLS) approach, I would run: (1) Regress x
on z1
and z2
and (2) Regress y
on x
, z2
and v
(the residuals from (1)). For more details for this approach see: https://www.irp.wisc.edu/newsevents/workshops/appliedmicroeconometrics/participants/slides/Slides_14.pdf. Unfortunately, I am not able to run the second regression without an error (see below).
My data looks like this:
df <- data.frame(
id = sort(rep(seq(1, 20, 1), 5)),
group = rep(seq(1, 4, 1), 25),
y = runif(100),
x = runif(100),
z1 = runif(100),
z2 = runif(100)
)
where id
is an identifier for the observations, group
is an identifier for the groups and the rest is defined above.
library(tidyverse)
library(broom)
# Nest the data frame
df_nested <- df %>%
group_by(group) %>%
nest()
# Run first stage regression and retrieve residuals
df_fit <- df_nested %>%
mutate(
fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
resids = map(fit1, residuals)
)
Now, I want to run the second stage regression. I've tried two things.
First:
df_fit %>%
group_by(group) %>%
unnest(c(data, resids)) %>%
do(lm(y ~ x + z2, data = .x))
This produces Error in is.data.frame(data) : object '.x' not found
.
Second:
df_fit %>%
mutate(
fit2 = map2(data, resids, ~ lm(y ~ x + z2, data = .x))
)
df_fit %>% unnest(fit2)
This produces: Error: Must subset columns with a valid subscript vector. x Subscript has the wrong type `grouped_df<
. If you would work with a larger data set, the second approach would even run into storage problems.
How is this done correctly?