1

I have a problem using the kclass() function of the RCompAngrist package when i have missing values in my df. It's a function that is supposed to calculate a "limited information maximum likelihood" estimator with a dependent variable on the left hand side and two parts on the right hand side of the equation. The first one for endogenous variables and the second for instruments. It is based on the ivreg() function of the AER package. Below is a minimum working example which will reproduce the error.

library(magrittr)
library(devtools)
install_github(repo = "RCompAngrist",
                       username = "MatthieuStigler",
                       subdir = "RcompAngrist")
library(RcompAngrist)
a <- runif(10, 5, 90)
b <- runif(10, 4, 10)
c <- runif(10, 0, 1)
d <- runif(10, 5, 65)
e <- runif(10, 1, 2)
f <- runif(10, 1, 100)
g <- runif(10, 80, 90)
h <- c(1,12,3,5,NA,16,17,NA,9,10)
dummy <- kclass(a ~ b + c + d | d + e + f + g + h,
model = T,
data=df)

If you run this code you should get this error message from R:

Error in cbind(x_exo, z, x_endo, y) : number of rows of matrices must match (see arg 2)

It has to do with the NAs in the data frame but i can't figur out what exactly is going wrong. It works if you create the variable "h" without the NAs. However, if you omit the NAs via

df <- data.frame(a,b,c,d,e,f,g,h) %>% na.omit()

before you reestimate the model, R gives me this error message:

Error in R_Z[c(n_G, n_y), c(n_G, n_y)] : subscript out of bounds

I also don't understand why it doesn't omit the NAs by itself since the global option for na.action is na.omit. It gets even weirder though. If you delete "data=df" from the function and then rerun the model the error message switches back to

Error in cbind [...]

Why does it make any difference here if "data=df" is in the code or not? Does anyone have any ideas where the problem could lie? I don't get at all what's going wrong here.

avocado1
  • 253
  • 2
  • 8

1 Answers1

0

It looks like there are a couple of bugs in your example. You have to define the variables within a data.frame for the data=df argument to work. I've made some adjustments below:

df <- data.frame(
  a = runif(10, 5, 90),
  b = runif(10, 4, 10),
  c = runif(10, 0, 1),
  d = runif(10, 5, 65),
  e = runif(10, 1, 2),
  f = runif(10, 1, 100),
  g = runif(10, 80, 90),
  h = c(1,12,3,5,NA,16,17,NA,9,10)
)

The call to RcompAngrist gives me:

m_rca <- RcompAngrist::kclass(a ~ b + c + d | d + e + f + g + h,
                              data = df)
#>Error in cbind(x_exo, z, x_endo, y) : 
#>number of rows of matrices must match (see arg 2)

m_rca <- RcompAngrist::kclass(a ~ b + c + d | d + e + f + g + h,
                             data = na.omit(df))
#> Error in R_Z[c(n_G, n_y), c(n_G, n_y)]: subscript out of bounds

In the first case, it looks like the NA are being dropped from h, but not from the other rows, which would give you the number of rows must match error. In the second, I'm not sure what's happening.

There are a couple alternatives. The best, if you don't require LIML, would be to use AER::ivreg, which will give you a two-stage least squares estimate.

You're also welcome to use the rkclass package (disclaimer: I wrote it, and Stigler's repo was a huge help while I was doing it), which agrees with the TSLS results from AER::ivreg, and offers a LIML option. It's in a state of flux though and hasn't been heavily tested, so it would be best to confirm the results.

m_aer <- AER::ivreg(a ~ b + c + d | d + e + f + g + h,
                    data = df)

m_rkc <- rkclass::kclass(a ~ b + c + d | d + e + f + g + h,
                         model.type = "TSLS",
                         data = df)

summary(m_aer)
#> 
#> Call:
#> AER::ivreg(formula = a ~ b + c + d | d + e + f + g + h, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -30.296 -16.990   5.157  11.238  28.988 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  64.7730    87.3657   0.741    0.500
#> b             1.3600     9.0175   0.151    0.887
#> c            -7.8303    52.9975  -0.148    0.890
#> d            -0.5200     0.5537  -0.939    0.401
#> 
#> Residual standard error: 27.63 on 4 degrees of freedom
#> Multiple R-Squared: 0.2457,  Adjusted R-squared: -0.3201 
#> Wald test:  0.43 on 3 and 4 DF,  p-value: 0.743


summary(m_rkc)
#> Call:
#> rkclass::kclass(formula = a ~ b + c + d | d + e + f + g + h, 
#>     data = df, model.type = "TSLS")
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -30.30 -16.99   5.16  11.24  28.99 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   64.773     87.366    0.74     0.50
#> b              1.360      9.018    0.15     0.89
#> c             -7.830     52.997   -0.15     0.89
#> d             -0.520      0.554   -0.94     0.40

The results for rkclass::kclass with LIML specified are wildly difference, so it would definitely be worth checking using something like Stata:

m_liml <- rkclass::kclass(a ~ b + c + d | d + e + f + g + h,
                     model.type = "LIML",
                     data = df)
summary(m_liml)
#>Call:
#>rkclass::kclass(formula = a ~ b + c + d | d + e + f + g + h, 
#>                data = df, model.type = "LIML")
#>
#>Residuals:
#>   Min     1Q Median     3Q    Max 
#>-19.45 -13.84  -1.77  10.82  25.89 
#>
#>Coefficients:
#>            Estimate Std. Error t value Pr(>|t|)
#>(Intercept)   18.784     36.142    0.52     0.63
#>b             10.678     11.677    0.91     0.41
#>c            -13.035     52.957   -0.25     0.82
#>d             -1.101      0.911   -1.21     0.29
potterzot
  • 638
  • 1
  • 4
  • 11
  • Yes i did make a data frame of it, don't know why it's not in the first part of the code That's how it was: `df <- data.frame(a,b,c,d,e,f,g,h)`. I thought so too, it omits NAs from "h" but not from the other variables. Any ideas why this is? I will also try your package and report back as soon as i have. – avocado1 Oct 04 '15 at 15:25