0

I have been running bootstraps with rbinom for loops in R, but they take too long to run.

I want to perform the bootstrap on a dataset with 1,500,000 rows.

I want to resample the rows and for each of the resampled rows

  1. rbinom two probabilities ('prob1' & 'prob2') into 0's and 1's ('prob1_ber' & 'prob2_ber')
  2. add new column 'paired' with the combined outcome of step 1
  3. rbinom the unique combinations of the columns 'paired' and 'positive' into 0's and 1's ('prob_final')
  4. calculate 'pair_FPR' and 'pair_TPR'

This is what my code looks like:

library(boot)

#making example data
set.seed(1)
d2 <- data.frame(prob1=runif(n=1500000, min=1e-50, max=.9999999999),
                 prob2=runif(n=1500000, min=1e-44, max=.9999999989),
                 Positive=sample(c(0,1), replace=TRUE, size=1500000))

#making bootstrap function
function_1 <- function(data, i){
  d2<-data[i,]
  
  d2$prob1_ber <- rbinom(nrow(d2), 1, d2$prob1) #bernoulli 1 or 0
  d2$prob2_ber <- rbinom(nrow(d2), 1, d2$prob2) #bernoulli 1 or 0
  
  d2$paired <- ifelse(d2$prob1_ber == 1 & d2$prob2_ber == 1, '11',
                      ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==0, '00',
                             ifelse(d2$prob1_ber == 1 & d2$prob2_ber ==0, '10',
                                    ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==1, '01', NA)))) 
  
  d2$prob_final <- ifelse(d2$paired == '00',d2$prob1_ber, NA) #if both negative then negative
  
  for (i in which(d2$paired =='11' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.9)
  }
  for (i in which(d2$paired =='11' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.5)
  }
  for (i in which(d2$paired =='01' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.8)
  }
  for (i in which(d2$paired =='01' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.1)
  }
  for (i in which(d2$paired =='10' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.7)
  }
  for (i in which(d2$paired =='10' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.2)
  }
  
  pair_FPR <- sum(d2[which(d2$Positive==0),]$prob_final) / nrow(d2[which(d2$Positive==0),])*100
  
  pair_TPR <- sum(d2[which(d2$Positive==1),]$prob_final) / nrow(d2[which(d2$Positive==1),])*100
  
  return(c(pair_FPR, pair_TPR))
}


set.seed(1)
boot_out <- boot(d2, function_1, 1000)
print(boot_out)

This bootstrap takes too long to run (n=1000). Is there a way to make it faster?

Many thanks!

  • `Error in NROW(data) : object 'd2' not found`. Also, you forgot to include `library(boot)`. Please help us to help you by making your example [reproducible](https://stackoverflow.com/help/minimal-reproducible-example). – Limey Dec 22 '22 at 15:44
  • Thankyou for your comment. I now added library(boot) and an example dataset d2 <- data.frame(prob1=runif(n=150000000, min=1e-50, max=.9999999999), prob2=runif(n=150000000, min=1e-44, max=.9999999989), Positive=sample(c(0,1), replace=TRUE, size=150000000)) Hope someone can help me :) – biomed_stats Dec 22 '22 at 16:18
  • `n=150000000` sets `n` to 150 million, not 1.5 million. Neither are "minimal"... – Limey Dec 22 '22 at 16:33
  • You are right, I changed it to n=1500000 now. Any idea on how to make it faster? – biomed_stats Dec 22 '22 at 16:47

1 Answers1

0

There's a good reason for the old saw about "if you're using R and thinking of using a for loop, there's probably a better way to do it". I think this is a case in point.

You've given no context or description of your overall objective and I've not taken the time to understand your code. I'm also confused about why you've taken advantage of R's vectorisation in some places, but not in others.

Also, I think the use of the boot library is a red herring. What matters is the underlying performance of your function function_1. Finally, I don't think there's any need to generate 150,000,000 obesrvations - or even 1,500,000 - to investigate underlying performance.

Hence, my attempt to improve your function is:

function_2 <- function(data, i){
  d2<-data[i,] %>% 
    mutate(
      prob1_ber=rbinom(nrow(.), 1, prob1), #bernoulli 1 or 0
      prob2_ber=rbinom(nrow(.), 1, prob2), #bernoulli 1 or 0
      paired=ifelse(prob1_ber == 1 & prob2_ber == 1, '11',
                      ifelse(prob1_ber == 0 & prob2_ber ==0, '00',
                             ifelse(prob1_ber == 1 & prob2_ber ==0, '10',
                                    ifelse(prob1_ber == 0 & prob2_ber ==1, '01', NA)))), 
      dprob_final=case_when(
        paired == '00' ~ prob1_ber,
        paired =='11' & Positive==1 ~ rbinom(1,1,0.9),
        paired =='11' & Positive==0 ~ rbinom(1,1,0.5),
        paired =='01' & Positive==1 ~ rbinom(1,1,0.8),
        paired =='01' & Positive==0 ~ rbinom(1,1,0.1),
        paired =='10' & Positive==1 ~ rbinom(1,1,0.7),
        paired =='10' & Positive==0 ~ rbinom(1,1,0.2)
    )
  )
  
  pair_FPR <- sum(d2[which(d2$Positive==0),]$prob_final) / nrow(d2[which(d2$Positive==0),])*100
  
  pair_TPR <- sum(d2[which(d2$Positive==1),]$prob_final) / nrow(d2[which(d2$Positive==1),])*100
  
  return(c(pair_FPR, pair_TPR))
}

And my test data is

N <- 15000
#making example data
set.seed(1)
d2 <- data.frame(prob1=runif(n=N, min=1e-50, max=.9999999999),
                 prob2=runif(n=N, min=1e-44, max=.9999999989),
                 Positive=sample(c(0,1), replace=TRUE, size=1500000))

Note that the results of function_1(d2, i) will not be the same as the results of function_2(d2, i) because of the order in which random numbers are generated. (function_2 works from row 1 to row n in order, function_1 works through rows in groups defined by pairedandpositive`.) However, I believe the distributional properties of the two functions will be the same.

So, to compare performance...

library(microbenchmark)

microbenchmark(
  list=list(
         "f1"= function_1(d2, 1), 
         "f2"= function_2(d2, 1)
       ), 
  times=10
)
Unit: nanoseconds
 expr min lq mean median uq max neval
   f1   7  9 28.7      9  9 203    10
   f2   8  9  8.9      9  9  10    10

A mean relative reduction in execution time of 100 * (27.7 - 8.9) / 27.7 = 67.8%. Relative performance may well depend on N, but I expect the benefit to increase with N because the benfits of vectorisation over looping should increase as N increases.

Bear in mind that using the tidyverse, whilst giving code that is usually easy to read and maintain, does not usually give the fastest execution times. data.table and base R are usually superior.

I leave it to others to improve on my effort. I'm sure it can be done.

Limey
  • 10,234
  • 2
  • 12
  • 32
  • Thanks a lot for your efforts! I agree that the for loops make the code slow. The difference between function_1 and function_2 is important. I aim to give the paired group (paired & Positive) a percentage of 0's and 1's. So for paired=01 & Positive=1 rbinom(0.8) I want about 80% of the rows to have 1's and 20% to have 0's. With function_2 this doesn't happen. Using your example data paired=01 & Positive=1 gives 1's for final_prob of all rows which is not what I would like to have. Maybe you know another way to give the group certain % of 0's and 1's now you understand the aim more? Many thanks – biomed_stats Dec 23 '22 at 11:15