Memory efficient cluster bootstrap

Question

I have a very large dataset (10m observations but reduced to 14 essential variables), and am following the following thread: Cluster bootstrapped standard errors in R for plm functions

My codes after loading libraries are:

fe_pois <- fepois(totalcountdeals ~ logdist + inst_dist_std + whited_wu_std:findev_std | iso_o_code + sic2, vcov=~pair, data = cbma, nthreads = 2)

boot_feols <- boottest(fe_pois, clustid = "pair", param = "logdist", B = 199, nthreads = 2)

However, this fails with large enough memory. Any other solutions. I need to bootstrap standard errors because one of my regressors is an estimate.

I have also tried filtering the data, and run the above on a sub-sample, just for a try. New error there is;

Error in if (!is.numeric(lower) || !is.numeric(upper) || lower >= upper) stop("lower < upper  is not fulfilled") : 
  missing value where TRUE/FALSE needed

A.Fischer · Accepted Answer · 2022-02-11T14:37:39.293

Thanks for your question!

fwildclusterboot::boottest() only supports estimation of OLS models, so running a Poisson regression should in fact throw an error in boottest(). I will have to add an error condition for that :)

The error that you observe

Error in if (!is.numeric(lower) || !is.numeric(upper) || lower >= upper) stop("lower < upper  is not fulfilled") : 
  missing value where TRUE/FALSE needed

stems from the numerical root finding procedure employed to compute confidence intervals - and I believe it is a direct result of fwildclusterboot not supporting Poisson regression.

The memory problems in both boottest and fwildclusterboot arise either because

the model you are fitting is very big and fwildclusterboot only accepts one fixed effect - all other factor variables specified in fixest are translated to dummies, hence the design matrix passed to boottest() might be very large. In fact if you do not use the fe argument for fwildclusterboot::boottest(), all fixed effects specified in feols() will be translated to dummies and no fixed effect is outprojected within the bootstrap. You can check if this is the root of your error by running your regression via lm() or glm() (or via a similar command in Stata) and see if these estimations fail due to memory as well .
boottest and fwildclusterboot are fully vectorized - hence both compute a weights matrix v, which is of dimension G x B, where G is the number of clusters and B the number of bootstrap iterations. If both G and B are large, this consumes quite a bit of memory! Stata.boottest has a function argument, matsize, that aims to help in such a situation - I quote from the documentation: "matsize(#) limits the memory demand of the G × B matrix v∗ to prevent caching of virtual memory to disk. The limit is specified in gigabytes; e.g., matsize(8) would limit the memory demand to 8GB. Note that this option does not limit the actual size of v∗. Instead, it forces boottest to break the matrix into chunks no larger than the limit, and then create and destroy each chunk in turn"

So I would suggest that you try out the matsize argument in boottest and see if your error occurs due to a large weights matrix?

Memory is a known issue for fwildclusterboot, and improving memory performance is work in progress.

Last, there is also a new Julia implementation of the fast wild cluster bootstrap algorithm in WildBootTests.jl that supports ML based models and is - in my experience - less memory demanding than fwildclusterboot.

Update 1 See also this discussion on using boottest and pplmhdfe.

Update 2 If you want to run the wild cluster bootstrap because you are afraid that the your numbers of clusters is low & your standard errors might be biased, an alternative might be to try a degrees-of-freedom correction as implemented for glm() in the clubSandwich package. Though I have to admit that I am not sure how well the implemented corrections work for Poisson regression.

Memory efficient cluster bootstrap

1 Answers1