I have a large dataset, with on the order of 2^15 entries, and I calculate the confidence interval of the mean of the entries with scipy.stats.bootstrap. For a dataset this size, this costs about 6 seconds on my laptop. I have a lot of datasets, so I find this takes too long (especially if I just want to do a test run to debug the plotting etc.). By default, Scipy's bootstrapping function resamples the data n_resamples=9999
times. As I understand it, the resampling and computing the average of the resampled data should be the most time-consuming part of the process. However, when I reduce the number of resamples by roughly three orders of magnitude (n_resamples=10
), the computational time of the bootstrapping method does not even half.
How can I do faster bootstrapping?
I'm using python3 and SciPy 1.9.3.
import numpy as np
from scipy import stats
from time import time
data=np.random.rand(2**15)
data=np.array([data])
start=time()
bs=stats.bootstrap(data,np.mean,batch=1,n_resamples=9999)
end=time()
print(end-start)
start=time()
bs=stats.bootstrap(data,np.mean,batch=1,n_resamples=10)
end=time()
print(end-start)
start=time()
bs=stats.bootstrap(data,np.mean,n_resamples=10)
end=time()
print(end-start)
gives
6.021066904067993
3.9989020824432373
30.46708607673645
To speed up bootstrapping, I have set batch=1
. As I understand it, this is more memory efficient, and prevents swapping the data. Setting a higher batch number increases the time consumption, as you can see above.