How to efficiently bootstrap groups (multilevel) using R

Question

I am analyzing a study which contains 40 individuals, each rating 10 vignettes.

indiv     vign      score    score2    gender    
  1         1         5         3        1
  1         2         2         4        1   
  1         3         8         1        1
  .         .         .         .        .
  .         .         .         .        .
  .         .         .         .        .
  39       10         9         1        1 
  40        8         1         5        0 
  40        9         3         8        0

I wanted to take a bootstrap, but I realized soon that it does not make sense to sample vignettes; we should sample persons instead (so we sample around 10 rows per person).

The following function works, but it is kind of the bottleneck for the next function. The question is then, how can this be done more efficiently?

ResampleMultilevel <- function(data, groupvar) {
  n <- length(unique(data[,groupvar]))

  index <- sample(data[ , groupvar], n, replace = TRUE)

  resampled <- NULL      # one of the issues is that we do not know 
                         # the size of the matrix yet, since it may vary. 
  for (i in 1:n) {
   resampled <- rbind(resampled, data[data[, groupvar] == index[i], ])
  }
  return(resampled)
}

The issue with subset is that I couldn't find a way to keep duplicates.

a <- cbind(rep(1:40, each = 10), rep(1:10, 4), rnorm(40), rnorm(40)), rep(1:10, 4), rnorm(40), rnorm(40))

index <- c(1,1)

subset(a, a[,1] == index)

Example data: `cbind(1:40, rep(1:10, 4), rnorm(40), rnorm(40))` — PascalVKooten, Mar 11 '13 at 23:26
What are currently using as the `groupvar` argument, `indiv` or `vign`? — Marius, Mar 11 '13 at 23:32
I think your for loop can be replaced with `data[index,]` . I think that will save a bit. — Seth, Mar 11 '13 at 23:32
@Seth, that doesn't work. You need to select around 10 vignettes for every number (person) in `index`. Do mind that there can also be duplicate people, which wouldn't be selected. — PascalVKooten, Mar 11 '13 at 23:37
the `createDataPartition` function in the `caret` package will generate bootstrapped samples based on factor levels. — Gary Weissman, Mar 12 '13 at 00:42

CHP · Answer 1 · 2013-03-12T07:13:07.043

Based on comments, I am ammending answer.

a <- cbind(rep(1:40, each = 10), rep(1:10, 4), rnorm(40), rnorm(40))
index <- c(1, 1, 3, 4, 2)
a[a[, 1] %in% index, ]
##       [,1] [,2]        [,3]        [,4]
##  [1,]    1    1  0.28135473  0.47970116
##  [2,]    1    2 -0.12628982  0.34862899
##  [3,]    1    3 -0.41140740  1.30204100
##  [4,]    1    4 -0.61163593 -1.13354157
##  [5,]    1    5 -0.31538238  1.42701315
##  [6,]    1    6 -0.20403098  2.13989392
##  [7,]    1    7  0.37681973  0.65843232
##  [8,]    1    8 -0.94062165  0.97246212
##  [9,]    1    9  0.63377352 -0.48948273
## [10,]    1   10 -0.39817929 -1.03607028
## [11,]    2    1  0.54866153 -0.55127459
## [12,]    2    2  0.08410140  0.01457366
## [13,]    2    3 -1.19006851  1.33213116
## [14,]    2    4 -0.47210092  0.83369309
## [15,]    2    5  0.75968678 -0.48212390
## [16,]    2    6 -1.00205770  0.56376027
## [17,]    2    7  0.67251644  0.07234657
## [18,]    2    8  0.73165780 -0.51483172
## [19,]    2    9 -0.26022238  2.33181762
## [20,]    2   10  0.03370091 -0.71427295
## [21,]    3    1  0.60810461  0.15054307
## [22,]    3    2 -1.29363706  1.30510127
## [23,]    3    3 -0.20479713 -2.39797975
## [24,]    3    4 -0.86927664 -0.10845738
## [25,]    3    5  0.89040130 -0.08459249
## [26,]    3    6 -0.21511823  1.33960644
## [27,]    3    7 -0.32413278 -0.31691484
## [28,]    3    8 -0.61545941 -0.10457591
## [29,]    3    9 -1.85072358  0.93267270
## [30,]    3   10  0.38456423  0.76231047
## [31,]    4    1  0.76016236  1.63854054
## [32,]    4    2 -0.94463491  1.87271085
## [33,]    4    3  1.62451250  1.63298961
## [34,]    4    4 -1.96908559  0.89058201
## [35,]    4    5  1.66755533  0.10288947
## [36,]    4    6 -0.02182803 -0.91358891
## [37,]    4    7 -0.09382921 -0.54950093
## [38,]    4    8  0.74597002  2.31924468
## [39,]    4    9  0.64732694  0.29681494
## [40,]    4   10 -0.66535049  1.81285111

Because we would want it to return way more. Remember, those 1,1,3,4,2 should be persons, each with around 10 vignettes attached. — PascalVKooten, Mar 12 '13 at 06:43
I see now my example is not valid... my mistake. Try this: `cbind(rep(1:40, each = 10), rep(1:10, 4), rnorm(40), rnorm(40))` — PascalVKooten, Mar 12 '13 at 06:46
a[which(a[,1] == 2),] this works a bit, bit now I would to replace "2" with a vector for which it could be true! — PascalVKooten, Mar 12 '13 at 06:49
So you don't want same row to be repeated but all the rows whose first column value is in index, is that right? — CHP, Mar 12 '13 at 07:01
I think it does not allow duplicates!!! This is a problem, because I would like to be able to sample, and then allow persons to be sampled twice. — PascalVKooten, Mar 12 '13 at 08:08

PascalVKooten · Answer 2 · 2013-03-12T07:00:11.953

0

a <- index <- 5:10

This almost works, except that the structure is not really the matrix I would like it to be.

lapply(index, function(x) a[which(a[,1] == x),])

Also, this almost gets there, if there would be a non-loop way to do this that would be great, because here it only works for the number 2:

a[which(a[,1] == 2),]       # works
a[which(a[,1] == index), ]  # does not work

edited Mar 12 '13 at 07:00

answered Mar 12 '13 at 06:52

PascalVKooten

20,643
17
103
160

How to efficiently bootstrap groups (multilevel) using R

2 Answers2