0

With an example dataframe pay, I am bootstrapping using base R. The main difference from classical bootstrapping is that a sample can have multiple rows which must all be included.

There are 7 ID's in pay, hence my goal is to create a sample of length 7 with replacement and create a new dataset resample containing the sampled ID's.

My code currently works but is inefficient given one million rows in my data and many repetitions required by bootstrap.

Creating pay:

ID    <- c(1,1,1,2,3,3,4,4,4,4)
level <-  c(1:10)
pay <- data.frame(ID = ID,level =  level)

My (inefficient) code for creating a single resampled dataset:

IDs <- levels(as.factor(ID))
samp <- sample(IDs, length(IDs) , replace = TRUE)
resample <- numeric(0)

for (i in 1:length(IDs))        
    {
temp <-  pay[pay$ID == samp[i], ]
resample <- rbind(resample, temp) 
    }

Result:

 samp
[1] "1" "2" "3" "1"


 resample
  ID level
1  1   0.5
2  1  -2.0
3  1   3.0
4  2   4.0
5  3   5.0
6  3   6.0
7  1   0.5
8  1  -2.0
9  1   3.0

I think the slowest part is extending resample with every iteration. However, I do not know how many rows there will be at the end. Thanks a lot for your help.

Dudelstein
  • 383
  • 3
  • 16

1 Answers1

2

You can sample the rows by doing

pay[sample(seq_len(nrow(pay)), replace=TRUE),]

It seems fairly efficient.

> system.time({
+   for (i in 1:10000)
+     pay[sample(seq_len(nrow(pay)), replace=TRUE),]
+ })
   user  system elapsed
  0.469   0.002   0.473

Edit:

Per Dudelstein's comment below, the above is incorrect. Here's a way to address what I think you're asking for.

samp <- sample(unique(ID), replace=TRUE)
do.call(rbind, lapply(samp, function(x) pay[pay$ID == x,]))

Benchmarking, it seems to be a third faster (roughly) compared to the original method. I'm sure there's a better way.

Josh
  • 1,248
  • 12
  • 25
  • My problem is that if ID = 1 is chosen in the sample, then all the rows where ID = 1 must be included in resample. This is not necessarily the case in your code. Sorry if this wasn't clear in the question. – Dudelstein Jul 29 '15 at 10:21
  • Thanks so much for your response, the code you wrote is ntoeably more effective, although it still takes about 30 minutes to resample my (rather large) dataset once. – Dudelstein Jul 30 '15 at 13:08