2

I am trying to calculate coverage probability for a set of residual bootstrap replicates I generated on the intercept and slope of regression . Can anyone show me how to calculate coverage probability of confidence intervals? Many thanks.

Note that I manually ran the regression using Qr decomposition but you can use lm() if that's easier. I just thought doing it manually will be faster.

set.seed(42)  ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)

mo <- lm(y ~ x, data=dd)

# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))

EDIT Incorporating jay.sf's answer, I rewrote the code that ran the lm() method and compared the first and second approach of calculating coverage probability in the link shared by jay.sf:

library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci

FUNInter <- function() {
  X <- model.matrix(mo)
  ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
  bootmod <- lm(ressampy.2 ~ X-1)
  confint(bootmod, "X(Intercept)", level = 0.95)
}

FUNBeta <- function() {
  X <- model.matrix(mo)
  ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
  bootmod <- lm(ressampy.2 ~ X-1)
  confint(bootmod, "Xx", level = 0.95)
}

set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))

#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
  all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
  all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405
cliu
  • 933
  • 6
  • 13

1 Answers1

1

According to Morris et. al 2019, Table 6, the coverage probability is defined as the probability how often real theta lies within a bootstrapped confidence interval (CI) (i.e. those of the model applied on many samples based on the actual data, or—in other words—new experiments):

enter image description here

Hence, we want to compute CIs based on OP's proposed i.i.d. bootstrap R times and calculate the ratio of how often theta is or is not in these CIs.

First, we estimate our model mo using the actual data.

mo <- lm(y ~ x)

To avoid unnecessary unpacking fitted values yhat, residuals u, model matrix X, and coefficients coef0 in the replications, we extract them beforehand.

yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25)  ## known from data generating process of simulation

In a bootstrap function FUN we wrap all the steps we want to do in one replication. In order to apply the very fast .lm.fit, we have to calculate the white standard errors manually (identical to lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1"))).

FUN <- function() {
  ## resampling residuals
  y.star <- yhat + sample(u, length(u), replace=TRUE)
  ## refit model
  fit <- .lm.fit(X, y.star)
  coef <- fit$coefficients[sort.list(fit$pivot)]
  ## alternatively using QR, but `.lm.fit` is slightly faster
  # qrX <- qr(X, LAPACK=TRUE)
  # coef <- qr.coef(qrX, y.star)
  ## white standard errors
  v.cov <- chol2inv(chol(t(X) %*% X))
  meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
  ## degrees of freedom adjust (HC1)
  d <- dim(X)
  dfa <- d[1] / (d[1] - d[2])
  white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
  ## 95% CIs
  ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
  ## coverage
  c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
    x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}

Now we execute the bootstrap using replicate.

R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
#   user  system elapsed
#  71.19   28.25  100.28 

head(res, 3)
#      intercept    x
# [1,]      TRUE TRUE
# [2,]     FALSE TRUE
# [3,]      TRUE TRUE

The mean of TRUEs in both columns simultaneously across the rows, or in each column respectively, gives the coverage probability we are looking for.

(cp.t <- mean(rowSums(res) == ncol(res)))  ## coverage probability total  
(cp.i <- colMeans(res))  ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
#  total intercept         x
# 0.8954    0.9478    0.9444

## values with other R:
#   total intercept         x
# 0.90700   0.95200   0.95200  ## R ==   1k
# 0.89950   0.95000   0.94700  ## R ==   2k
# 0.89540   0.94780   0.94440  ## R ==   5k
# 0.89530   0.94570   0.94680  ## R ==  10k
# 0.89722   0.94694   0.94777  ## R == 100k

And this is how it looks like after 100k repetitions

enter image description here

Code for plot:

r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)

matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R', 
        ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')

Data:

set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks for the answer@jay.sf but how does this reflect that the confidence intervals are derived from the bootstrapped residuals which is the method I am interested in testing after all – cliu Dec 27 '20 at 17:53
  • @cliu Ah sorry, I was focused on the coverage probability so much that I've overlooked the residuals instead of the coefficients! Actually you want the CIs of your `boot` object, right? And repeat that process many times to get the coverage probability, i.e. calculate how often the bootstrap CIs lie within the "HC2" CIs? – jay.sf Dec 27 '20 at 18:05
  • Yes, that's right. I tried this ```lointer <- boot[,1]-1.96 *sd(boot[,1])/sqrt(length(boot[,1])); upinter <- boot[,1]+1.96 *sd(boot[,1])/sqrt(length(boot[,1])); fdinter <- data.frame(L=lointer, U=upinter); manu_interCR <- mean(lointer <= 50 & 50 <= upinter)```but the coverage probability is just 0.043 and it doesn't seem right :( – cliu Dec 27 '20 at 18:14
  • @cliu Hm, usually you take the `1±alpha/2` quantiles to get bootstrap CIs. See my edited answer, what do you think? – jay.sf Dec 27 '20 at 18:19
  • Yes, that's what I was looking for. Thanks! By the way, do you know why the coverage probability for x was so low? Also an ice-on-the-cake question: do you know what is the base R version of `matrixStats::colQuantiles`? – cliu Dec 27 '20 at 19:28
  • @cliu Great, nice to hear that! I believe you'll get better results when you have more observations in `dd`. I've edited your data generating code to `n=100`, now the results are better. Also see Edit for base R code versions. (I also use few packages myself, `matrixStats` functions are pretty convincing in terms of speed though.) – jay.sf Dec 27 '20 at 19:54
  • Just a follow-up comment: when calculating coverage probability, I was following Morris et al. (2019) http://onlinelibrary.wiley.com/doi/full/10.1002/sim.8086 (see Coverage in Table 6) which is close to the second approach mentioned in your link. Do you know any empirical papers that backed the first approach? I think it is also easier to implement the second approach since you don't need to derive the confidence interval from the original sample in the first place – cliu Dec 28 '20 at 02:46
  • @cliu Interesting. Yes I think [Schall 2012](https://pubmed.ncbi.nlm.nih.gov/22623325/) section 2.2 basically covers the approach we did here, where we got the CIs of step 2b of the algorithm by residual bootstrapping, doesn't it? – jay.sf Dec 28 '20 at 06:12
  • Thank you for the paper. To me it is still in line with the Morris et al. approach (note the sentence "The coverage τn, therefore, is calculated as the proportion c/S of simulated CIs C1−α (x) that cover the parameter θ = θ (F ) of the distribution"). I rewrote the code that implemented the traditional `lm()` method and compared the first and second approach of calculating coverage probability and they diverged. The result from the second approach seems more reasonable though (see my Edit of the question). – cliu Dec 28 '20 at 15:35
  • @cliu I see. I'm playing around with the new approach. First thing I've noted you should apply the same method for CI in the bootstraps. However, the results then are even worse so far. – jay.sf Dec 29 '20 at 08:10
  • @cliu What is actually your source of this particular kind of residual bootstrap? – jay.sf Dec 29 '20 at 14:32
  • 1
    It has various names (e.g., iid bootstrap, fixed-x resampling) and probably that's why it causes some confusion. See https://socialsciences.mcmaster.ca/jfox/Books/Companion-2E/appendix/Appendix-Bootstrapping.pdf for an introduction of the method – cliu Dec 29 '20 at 15:03
  • @cliu Thanks for sharing the sources! Our suspicion that something was fishy was correct. Actually we want to know the probability that our actual coefficients (and not the CIs) would be covered by the CIs of future experiments. See my revised answer. – jay.sf Dec 29 '20 at 18:38
  • 1
    Cool! I got stuck initially because I wasn't able to extract CIs but only coefficients from the replicated data based on the QR decomposition (and for that reason I thought that was a dead-end), although later I found a way around by using the `lm()` approach. Your answer is another way (possibly more robust and faster) of finding CIs by attaching some white standard errors. One thing clear though is that all the results help clarify the definition of coverage probability which is the rate of CIs that capture the true coefficients. – cliu Dec 29 '20 at 21:13
  • Just came to revisit your code @jay.sf. One minor thing: I think the true parameters are 50 and 25, not the ones retrieved from the model: `ci[1,1]` and `ci[1,2]` as these are empirical estimators derived from one realization/sample of the population – cliu Jan 07 '21 at 02:59
  • @cliu You're referring to [Chapter 3: ..."θ is used to represent a conceptual estimand and its true value"](https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8086#sim8086-sec-0003-title), don't you? So we'd rather use 50 and 25 instead of `coef0` in the last lines of `FUN`, right? – jay.sf Jan 07 '21 at 06:20
  • Yes, that's right@jay.sf. In non-simulation studies, we never know the values of true parameters but in simulation studies that's something we do have some control and set it up in the beginning. It's not a major problem but I just want to make your code more conceptually accurate – cliu Jan 07 '21 at 17:40
  • 1
    @cliu Great, then we have it! I also brought some color into the game with a plot. See update. – jay.sf Jan 07 '21 at 18:55
  • 1
    @cliu Please note my edit of `FUN` in line 6 which sorts coefficients according to `sort.list(fit$pivot)`. – jay.sf Jan 12 '21 at 11:18
  • Hi @jay.sf, I hope you are doing well! These days I am revisiting this CP problem and your answer. I just have a quick question: can you share the code for the plot shown at the bottom of your answer? Thank you, as always! – cliu Dec 13 '21 at 17:00
  • 1
    Hi @cliu, see update please. `R version 4.1.2` was used. – jay.sf Dec 14 '21 at 18:12