2

I would like to use the boot() and boot.ci() functions from library("boot") for a large data set(~20 000) with type="bca".

If R(number of bootstraps) is too small (I have tried 1k - 10k), then I get the following error:

Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, :
estimated adjustment 'a' is NA

However, if I do 15k - 20+k bootstraps, then I get:

Cannot allocate vector size # GB

(usually ranging from 1.7 to 6.4gb, depending on the dataset and # of bootstraps).

I read that I needed to have more ram, but I have Windows desktop with 16gb ram and I'm using 64-bit R, suggesting my computer should be able to handle this.

How can I use bootstrapping methods on larger datasets if too few bootstraps cannot produce estimates and sufficient bootstraps results in insufficient memory?

My code:

multRegress<-function(mydata){
          numVar<<-NCOL(mydata)
          Variables<<- names(mydata)[2:numVar]

          mydata<-cor(mydata, use="pairwise.complete.obs")
          RXX<-mydata[2:numVar,2:numVar]
          RXY<-mydata[2:numVar,1]

          RXX.eigen<-eigen(RXX)
          D<-diag(RXX.eigen$val)
          delta<-sqrt(D)

          lambda<-RXX.eigen$vec%*%delta%*%t(RXX.eigen$vec)
          lambdasq<-lambda^2
          beta<-solve(lambda)%*%RXY
          rsquare<<-sum(beta^2)

          RawWgt<-lambdasq%*%beta^2
          import<-(RawWgt/rsquare)*100

          result<<-data.frame(Variables, Raw.RelWeight=RawWgt, 
          Rescaled.RelWeight=import)
     }

# function passed to boot
multBootstrap <- function(mydata, indices){         
                   mydata<-mydata[indices,] 
                   multWeights<-multRegress(mydata) 
                   return(multWeights$Raw.RelWeight) 
                   }

# call boot
multBoot<-boot(thedata, multBootstrap, 15000)
multci<-boot.ci(multBoot,conf=0.95, type="bca")
user20650
  • 24,654
  • 5
  • 56
  • 91
Ray Fang
  • 21
  • 3
  • how many statistics are you bootstrapping? – user20650 May 21 '20 at 22:40
  • ...from https://stats.stackexchange.com/questions/37918/why-is-the-error-estimated-adjustment-a-is-na-generated-from-r-boot-package#comment333957_37931 seems you need the number of reps to be at least the number of rows of your data – user20650 May 21 '20 at 22:47
  • When I add the number of reps to be greater than or equal to the number of rows, I run out of memory, even at 1.7gb (even though I have 16gb memory on my cpu) – Ray Fang May 21 '20 at 23:21
  • That is why I was asking how many parameters you are bootstrapping? – user20650 May 21 '20 at 23:27
  • ... If you run a simple example of `bs <- boot(...)` and inspect the object you can see that it stores several objects. I'd expect the largest to be the data (`bs$data`) and the simulated statistics (`bs$t`) which will have dimension`R` (nsim) by n_parameters. If you have many many parameters this could be large. Otherwise it could be how you are generating the bootstrap statistics -- your function -- that is causing the memory issues. Can you share the function that you are using in the `boot` call -- even better if you can supply a **small** reproducible example, data included. – user20650 May 21 '20 at 23:35
  • I have 16 bootstrap statistics as I have 16 predictors in a relative weights analysis: – Ray Fang May 22 '20 at 13:09
  • Thanks. Okay so from that description it seems unlikely that it is the data that is retuned by the boot function that is causing the memory issues. Are you parallelising the boot function - perhaps you dont have enough memory for it too be run on each core. Have you tracked the memory from one run of your model? Do you have any other large objects in your workspace that could be consuming memory - one way this could hapen is if you have loaded a pevious workspace; so try restarting r and run `ls()` straight away to see if there are any objects. – user20650 May 22 '20 at 13:17
  • Thank you! How do I parallelise the boot function? I've done the restarting thing so it isn't my workspace. – Ray Fang May 22 '20 at 13:21
  • it is documented in the `?boot` help page -- but this will not help your memory issue and will likely make it worse. So it could be the function that you are using -- try profiling the memory of it outside of boot https://stackoverflow.com/questions/7856306/monitor-memory-usage-in-r. I can't really offer anything else without seeing your model/data. – user20650 May 22 '20 at 13:26
  • I have posted it. Sorry this is my first time asking a question here. I was having trouble showing you the code – Ray Fang May 22 '20 at 13:28
  • Thanks. You seem to have posted helper functions for your results but not the function that you pass to `boot` i.e. `statistic=multBootstrap` – user20650 May 22 '20 at 13:33
  • Does this help? lol sorry, thank you so much for your patience. multBootstrap<-function(mydata, indices){ mydata<-mydata[indices,] multWeights<-multRegress(mydata) return(multWeights$Raw.RelWeight) } – Ray Fang May 22 '20 at 13:45
  • Thanks. I have updated your question (and removed functions that were not used in the code example). What package is `multRegress` from please? – user20650 May 22 '20 at 13:49
  • Thank you so much! I have included the code for multRegress. – Ray Fang May 22 '20 at 14:09
  • ... another link https://stackoverflow.com/questions/5184953/memory-profiling-in-r-tools-for-summarizing – user20650 May 22 '20 at 14:44
  • Thank you! I will look at that link. But shouldn't my desktop be able to handle "vector sizes of 1.7gb"? I am running 64 bit R and I have 16 GB ram. – Ray Fang May 22 '20 at 15:42
  • Sorry it seems as if i misunderstood; can i check please; is it `boot.ci` that gives the memory issue rather than the `boot` function? (from some quick tests in seems as if bca is quite ram hungry) – user20650 May 22 '20 at 17:30
  • It seems as if the `empinf` function (which is called from `boot:::bca.ci`) is slow. It is coded in base R had has several split-apply-combine routines. I think this could be made more memory efficient / fast by recoding it in data.table syntax but that seems a bit if a pain. There may be a way to get an estimate of the quantity that empinf returns (i dont know) but the folks at https://stats.stackexchange.com/ may have a suggestion. – user20650 May 23 '20 at 17:25

0 Answers0