generate bootstrap sample dependent on column

Question

I have a data set like this

set.seed(1)
df <- data.frame(ID = rep(1:4, each = 3),
                 x = c(1,2,3,2,3,4,1,2,3,3,4,5),
                 V1 = rnorm(12))

> df
   ID x         V1
1   1 1 -0.6264538
2   1 2  0.1836433
3   1 3 -0.8356286
4   2 2  1.5952808
5   2 3  0.3295078
6   2 4 -0.8204684
7   3 1  0.4874291
8   3 2  0.7383247
9   3 3  0.5757814
10  4 3 -0.3053884
11  4 4  1.5117812
12  4 5  0.3898432

this example contains 4 individuals, defined by ID. Each individual has an observation period x. For example ID 1 is observed at time points 1,2,3.

In this example I have 2 observations at time point 1 (ID 1 and ID 3), and 3 observations at time point 2 (IDs 1,2,3)

I now want a bootstrapped (sample with replacement) data set that contains the same number of observations at each time point.

In this example the data set could look like this:

> df
   ID x         V1
1   1 1 -0.6264538
1   1 1 -0.6264538
2   1 2  0.1836433
2   1 2  0.1836433
3   1 3 -0.8356286
4   2 2  1.5952808
5   2 3  0.3295078
6   2 4 -0.8204684
6   2 4 -0.8204684
7   3 1  0.4874291
7   3 1  0.4874291
8   3 2  0.7383247
9   3 3  0.5757814
10  4 3 -0.3053884
11  4 4  1.5117812
11  4 4  1.5117812
12  4 5  0.3898432
12  4 5  0.3898432
12  4 5  0.3898432
12  4 5  0.3898432

this data set now has 4 observations at each time point.

score 2 · Accepted Answer · answered Mar 08 '19 at 08:02

We could first find the maximum number of times x occurs and sample_n for each x with replace = TRUE to get equal number of rows for each x.

max_sample <- max(table(df$x))

library(dplyr)

df %>%
  group_by(x) %>%
  sample_n(max_sample, replace = TRUE) %>%
  arrange(x)

#      ID     x     V1
#   <int> <dbl>  <dbl>
# 1     3     1  0.487
# 2     1     1 -0.626
# 3     1     1 -0.626
# 4     1     1 -0.626
# 5     3     2  0.738
# 6     2     2  1.60 
# 7     2     2  1.60 
# 8     3     2  0.738
# 9     4     3 -0.305
#10     2     3  0.330
#11     2     3  0.330
#12     4     3 -0.305
#13     4     4  1.51 
#14     4     4  1.51 
#15     4     4  1.51 
#16     4     4  1.51 
#17     4     5  0.390
#18     4     5  0.390
#19     4     5  0.390
#20     4     5  0.390

thank, I should add that `x` does not always start with 1, it ranges from -20 to +20 — spore234, Mar 08 '19 at 08:06
@spore234 umm...it should not matter I think because we are counting frequency of `x` with `table` irrespective of what it's value is. — Ronak Shah, Mar 08 '19 at 08:08

generate bootstrap sample dependent on column

1 Answers1