0

I have a monthly time series of count data that I have fitted a negative binomial AR1 segmented regression model with dummy variables for months to control for seasonal effects and an indicator variable for an intervention. I am attempting to quantify the magnitude of the effect of the intervention by calculating the difference of the predicted values during the last 12 months of the series, with and without the intervention effect. I want to report the confidence intervals of the difference. I have a rough understanding of how to do this and understand the need for using the multivariate delta method, but I have a hard time implementing this in R. Dummy data for clarity:

## An explanation of the structure of the df
df <- data.frame(
    count = c(10,12,9,9,12,15,17,13,10,6), 
    month = facotr(month.name, levels = month.name), 
    time = 1:length(timeseries), 
    intervention = ifelse(time >= time_of_intervention, 1, 0), 
    intervention_trend = c(rep(0, time_of_intervention), 1:(length(timeseries) - (time_of_intervention - 1))),
    lag1 = c(NA, count[-nrow(df)])
)

## models with and without the intervention
nullmodel <- glm.nb(count ~ month + time + log(lag1), data=df)
fullmodel <- glm.nb(count ~ intervention_level + intervention_trend + month + time + log(lag1), data = df)

## the difference I would like confidence intervals for
sum_without <- sum(exp(predict(nullmodel)[length(predict(nullmodel))- 11:0]))
sum_with <- sum(exp(predict(fullmodel)[length(predict(fullmodel))- 11:0]))
diff <- sum_with - sum_without

I thought I would use multivariate normal sampling of the coefficients of the model with and without and do repeated sums of the fitted values of last 12 timepoints to produce a distribution and estimate the confidence intervals from there. I don't know how to do this in R. This is as far as I have gotten:

without_coeff_sample <- mvrnorm(n = 10000, mu = nullmodel$coefficients, Sigma = vcov(nullmodel))
with_coeff_sample <- mvrnorm(n = 10000, mu = fullmodel$coefficients, Sigma = vcov(fullmodel))

I'm not sure how to proceed from here. Any thoughts appriciated.

user6571411
  • 2,749
  • 4
  • 16
  • 29
  • 1
    I might be wrong about this, but if you want to "quantify the magnitude of the effect of the intervention", wouldn't you be comparing predicted values between datasets where `intervention` was set to 0 and 1 respectively (and `intervention_trend` set accordingly), using simulated coefficients from `fullmodel` only? – Weihuang Wong Aug 13 '16 at 16:42
  • @WeihuangWong both intervention and intervention_trend are dummy variables indicating when the intervention took place in a time series (0 before the intervention and 1 after) and the relative change in trend compared to baseline trend (0 before the intervention and 1:length(timeseries) after). Sorry I did not explain sufficiently. – user6571411 Aug 13 '16 at 17:58
  • 1
    Right, I'm just puzzled (and I could be wrong) by why you're comparing two _models_ to get your treatment effects. How I have usually done it is to fit a model, simulate coefficients (as you're doing using `mvrnorm`), and get treatment effects by counterfactually setting values for treatment (`intervention`, in your case) in my data, then getting predictions from there. See, e.g. [the approach used in this question](http://stackoverflow.com/q/11167154/6455166). – Weihuang Wong Aug 13 '16 at 18:13
  • @WeihuangWong I think you are correct. I have attempted to replicate this based on the answer you linked to but am running into the same issue, that is to say, I don't know how to use the data generated with mvrnorm to produce fitted values. If you would care to give an example of the steps required it would very likely answer my question. – user6571411 Aug 13 '16 at 19:11

1 Answers1

0

I use the built-in quine dataset as an example. Suppose we are interested in the "effect" of being Non-Aboriginal (Eth==N) on days of school missed. Step 2 is where we create the counterfactual data, setting Eth to 1 (for Aboriginal) or 2 (for Non-Aboriginal). In step 3, we source a function written by SO user MrFlick that allows us to make a fake glm object (with coefficients we defined ourselves) to pass into predict(). In step 4, we compute predicted values. We create a fake glm object, using the simulated coefficients, and pass it into predict together with our (counterfactual) data.

library(MASS)

# (1) Fit model and simulate coefficients
quine.nb1 <- glm.nb(Days ~ Sex + Age + Eth + Lrn, data = quine)
coefs <- mvrnorm(1000, coef(quine.nb1), vcov(quine.nb1))

# (2) Create counterfactual data
EthA <- EthN <- quine
EthA$Eth <- 1
EthN$Eth <- 2

# (3) Source function `makeglm` to create "fake" glm object
source("https://gist.githubusercontent.com/MrFlick/ae299d8f3760f02de6bf/raw/6c54045bf29d64b31239fb9c56d5f32f085cdc40/makeglm.R")

# (4) Create predicted values using simulated coefficients and counterfactual data
preds <- lapply(list(EthA, EthN), function(dd) {
  apply(coefs, MAR = 1, function(x) {
    newmodel <- do.call(
                  makeglm, 
                  c(list(formula = as.formula(Days ~ Sex + Age + Eth + Lrn)), 
                    x,
                    list(family = quasi)))
    exp(predict(newmodel, dd))
  })
})

# (5) Compute average treatment effect
ATE <- rowMeans(preds[[2]] - preds[[1]])
quantile(ATE, c(0.05, 0.5, .95))
#        5%       50%       95% 
# -6.798913 -5.188922 -2.791218 
mean(ATE)
# [1] -5.182645

All else equal, Non-Aboriginal students miss 5 fewer days of school than Aboriginal students.

One of the issues you should consider is what your estimand is: in the example above, it is the average treatment effect. However, you might want to estimate the average effect on the treated, i.e. what if units that received intervention did not receive treatment? Your estimand will affect how you create your counterfactual data. In addition, since the linear time trend for intervention depends on intervention, you also need to change that variable when you create your counterfactual data. One final issue that complicates your analysis is that you have a lagged variable. It seems that every predicted value would depend on the previously predicted value. I've never done something like this before; intuitively, you may need to generate predicted values iteratively (?).

Community
  • 1
  • 1
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48
  • Very interesting. I'm going to work through this and see if it fits. I have the same questions as you regarding the lag variable. Time to hit the books – user6571411 Aug 13 '16 at 21:44