Applying 'clustering functions' to a series of linear models

Question

I want to iterate over a list of linear models and apply "clustered" standard errors to each model using the vcovCL function. My goal is to do this as efficiently as possible (I am running a linear model across many columns of a dataframe). My problem is trying to specify additional arguments inside of the anonymous function. Below I simulate some fake data. Precincts represent my cross-sectional dimension; months represent my time dimension (5 units observed across 4 months). The variable int is a dummy for when an intervention takes place.

df <- data.frame(
  precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
  month = rep(1:4, 5),
  crime = rnorm(20, 10, 5),
  int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
  )

df[1:10, ]

outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })

se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })

I receive the following error message when adding the cluster argument inside of the vcovCL function.

Error in eval(expr, envir, enclos) : object 'x' not found

The only way around it, in my estimation, would be to index the dataframe, i.e., df$, and then specify the 'clustering' variables. Could this be achieved by specifying an additional argument for df inside of the function call? Is this code efficient?

Maybe specifying the model equation formulaically is a better way to go, I suppose.

Any thoughts/comments are always helpful :)

My apologies for not providing a better example. I would be using multiple columns in a dataframe. I only used one for illustrative purposes. — Thomas Bilach, May 03 '19 at 23:07
Yes, it was super helpful. Do you mean the 'triangle' or the 'check mark'? Still getting used to everything. — Thomas Bilach, May 05 '19 at 16:57
The "up" triangle is for upvotes. You would upvote posts that you found particularly helpful. People whose posts have been upvoted then receive 10 reputation points. You can also loose them if people downvote your post with the downward-facing triangle. The check-mark is for accepting answers. — coffeinjunky, May 05 '19 at 17:16

coffeinjunky · Accepted Answer · 2019-05-04T11:30:32.573

Here is one approach that would retrieve clustered standard errors for multiple models:

library(sandwich)

# I am going to use the same model three times to get the "sequence" of linear models. 
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)

# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}

# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)

If you want to get the entire output adjusted, the following might be helpful:

library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}

adjusted_models <- lapply(list(mod, mod, mod), adj_stats)

To address the multiple column issue:

In case you are struggling with running linear models over several columns, the following might be helpful. All the above would stay the same, except that you are passing your list of models to lapply.

First, let's use this dataframe here:

df <- data.frame(
  precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
  month = rep(1:4, 5),
  crime = rnorm(20, 10, 5),
  crime2 = rnorm(20, 10, 5),
  crime3 = rnorm(20, 10, 5),
  int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)

Let's define the outcome columns:

outcome_columns <- c("crime", "crime2", "crime3")

Now, let's run a regression with each outcome:

models <- lapply(outcome_columns, 
         function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )

And then you would just call

adjusted_models <- lapply(models, adj_stats)

Regarding efficiency:

The above code is efficient in that it is easily adjustable and quick to write up. For most use cases, it will be perfectly fine. For computational efficiency, note that your design matrix is the same in all cases, i.e. by precomputing the common elements (e.g. inv(X'X)*X'), you could save some computations. You would however lose out on the convenience of many built-in functions.

Applying 'clustering functions' to a series of linear models

1 Answers1