In R with clustered data, how would you bootstrap at the cluster level and keep the same observation when clusters are chosen repeatedly?

Question

I am trying to bootstrap sample in R from a longitudinal dataset with multiple observations per person (i.e. data collected in multiple waves over time). So the data look like this:

id     wave   variable
101    1      15
101    2      17
101    3      18
102    1      13
102    2      14
102    3      14
103    1      13
103    2      15
103    3      17

What I would like to do is sample at the PERSON level and keep only one observation (wave) per person, randomly chosen, but keep the same observation if/when a person is sampled multiple times. So a bootstrap sample could look like this:

id     wave   variable
101    1      15
103    2      15
101    1      15

but never like this:

id     wave   variable
101    1      15
103    2      15
101    2      17

I'm struggling with how to code this at all, much less do it elegantly. Any thoughts would be much appreciated.

Allan Cameron · Answer 1 · 2020-02-26T20:31:24.470

You can get a data frame with one row for each ID chosen at random, then just sample this data frame with replacement:

set.seed(69)
dfs <- split(df, df$id)
dfs <- mapply(function(x, y) x[sample(y,1),], dfs, sapply(dfs, nrow), SIMPLIFY = FALSE)
result <- do.call(rbind, dfs)
result[sample(nrow(result), 9, TRUE), ]
#>        id wave variable
#> 101   101    1       15
#> 103   103    2       15
#> 103.1 103    2       15
#> 103.2 103    2       15
#> 102   102    3       14
#> 101.1 101    1       15
#> 103.3 103    2       15
#> 102.1 102    3       14
#> 102.2 102    3       14

^{Created on 2020-02-26 by the reprex package (v0.3.0)}

score 0 · Answer 2 · answered Feb 26 '20 at 20:25

Your example:

x = structure(list(id = c(101L, 101L, 101L, 102L, 102L, 102L, 103L, 
103L, 103L), wave = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), variable = c(15L, 
17L, 18L, 13L, 14L, 14L, 13L, 15L, 17L)), class = "data.frame", row.names = c(NA, 
-9L))

Maybe something like this, if you don't mind dplyr:

set.seed(111)
x %>% group_by(id) %>% sample_n(1) %>%  
ungroup() %>% sample_n(n(),replace=TRUE)

# A tibble: 3 x 3
     id  wave variable
  <int> <int>    <int>
1   103     3       17
2   101     2       17
3   103     3       17

In the first line, you group by id, and sample 1. Next you ungroup, so you have only unique ids. Then it's a matter of sampling these rows with replacement.. Hope I got it correct.

dario · Accepted Answer · 2020-02-26T21:30:13.950

We could first sample for each id one of its wave value and then inner_join the original data. Then we bootstrap sample from this 'filtered' list...

Create larger data set to reproduce sampling:

set.seed(13)
df <- data.frame(id = rep(101:103, each=9),
                 wave = rep(1:3, times=9),
                 variable = sample(1:20,9*3, TRUE))

head(df)

   id wave variable
1 101    1        4
2 101    2        2
3 101    3        1
4 101    1       19
5 101    2       19
6 101    3       17

Solution using dplyr:

library(dplyr)

  boot_size = 1000

boot <- df %>% 
  inner_join(df %>% 
               group_by(id, ) %>% 
               sample_n(1) %>% 
               select(id, wave)) %>% 
  sample_n(boot_size, replace = TRUE)

Test if it worked:

  head(boot)

   id wave variable
1 101    2        5
2 103    3        4
3 102    3       11
4 103    3        3
5 103    3        3
6 101    2        6

table(boot$id, boot$wave)

      2   3
101 323   0
102   0 353
103   0 324

Looks good, every id has values from only one wave

Edit:

I accidentally posted a working but very inefficient and stupid version of the solution, where my join data.frame selected from all combinations of id, wave AND variable. But we don't need all these combinations at this step. I exchanged that line of code with a less stupid one. Sorry.

In R with clustered data, how would you bootstrap at the cluster level and keep the same observation when clusters are chosen repeatedly?

3 Answers3

Edit: