0

I just learned how to do bootstrap in R, and I'm excited. I was playing with some data, and found that, doesn't matter how many bootstrap samples I take, the CIs seem to be always around the same. I believe that, the more samples, the more narrow should the CI be. Here's the code.

library(boot)

M.<-function(dados,i){
d<-dados[i,]
mean(d$queimadas)
}

bootmu<-boot(dados,statistic=M.,R=10000)

boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootmu)

Intervals : 
Level      Normal              Basic         
95%   (18.36, 21.64 )   (18.37, 21.63 )  

Level     Percentile            BCa          
95%   (18.37, 21.63 )   (18.37, 21.63 )  
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(bootmu) : bootstrap variances needed for studentized intervals

As one can see, I took 10000 samples. Now let's try with just 100.


bootmu<-boot(dados,statistic=M.,R=100)

boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
boot.ci(boot.out = bootmu)

Intervals : 
Level      Normal              Basic         
95%   (18.33, 21.45 )   (18.19, 21.61 )  

Level     Percentile            BCa          
95%   (18.39, 21.81 )   (18.10, 21.10 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In boot.ci(bootmu) :
  bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
  extreme order statistics used as endpoints
> 

The sample size is many times lower, but the CIs are essentially the same. Why?

If anyone wants to replicate the exact same example, here's the data.

> dados
   queimadas plantacoes
1         27        418
2         13        353
3         21        239
4         14        251
5         18        482
6         18        361
7         22        213
8         24        374
9         21        298
10        15        182
11        23        413
12        17        218
13        10        299
14        23        306
15        22        267
16        18         56
17        24        538
18        19        424
19        15         64
20        16        225
21        25        266
22        21        218
23        24        424
24        26         38
25        19        309
26        20        451
27        16        351
28        15        174
29        24        302
30        30        492
Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Leonardo
  • 3
  • 3

1 Answers1

1

The confidence interval for your estimator does not depend on the number of bootstrap replicates, it depends on the size of the original dataset.

Increasing the number of bootstrap replicates will increase the precision with which the sampling distribution (hence the confidence intervals) are calculated, but cannot make your estimate of the mean of your samples more precise.

Try calculating the confidence interval around the mean using an analytic method for comparison.

> confint(lm(dados$queimadas~1))
               2.5 %   97.5 %
(Intercept) 18.27624 21.72376

You will see that both bootstraps (with 100 or 10000 samples) are both estimating the CI calculated by linear regression fairly well

George Savva
  • 4,152
  • 1
  • 7
  • 21
  • By "size of original dataset" I suppose you mean the size of each bootstrap replicate (n), right? Besides that, aren't the CI's calculated from the mean of the bootstrap replicates? Then the replicates becomes basically a sample to calculate CIs from. And the higher the sample size (in this case, number of replicates), shouldn't the the CIs get more narrow? If you could point me to a technical explanation for that, it'd be nice! Because I don't see why this is happening. – Leonardo Apr 04 '22 at 01:16
  • No, the CIs are calculated based on the variance of the bootstrap replicates, it is *not* the CI for the mean of the boostrap samples. More boostrap replicates doesn't change the standard deviation of this sampling distribution. I don't have a good reference. – George Savva Apr 04 '22 at 10:38
  • Bootstrapping isn't a way to get more data, or more precision around your estimate. Boostrapping is a way to estimate what the true standard error of your original estimator is. The more replicates you have, the better you will estimate the correct confidence interval. We use bootstrapping when correctly calculating confidence intervals any other way is difficult. – George Savva Apr 04 '22 at 10:40
  • Look at the result from `confint(lm(dat$queimadas~1))`. I have added this to the answer – George Savva Apr 04 '22 at 12:38
  • So you're saying that, if I do N bootstraps with 10000 replicates and N bootstraps with 10 replicates, the means of the CIs (let's say, the means of the lower bonds) will be the same, but the variances will be different? (more precision in the CI estimation with more replicates, thus smaller variance in the many CI estimations). – Leonardo Apr 04 '22 at 14:18
  • Yes. You could try this to check it for yourself. – George Savva Apr 04 '22 at 14:30