0

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.

df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))

boot.index <- sample(unique(df$ID), replace = TRUE)

Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this

ID v1 v2
B  4  0
B  2  1
B  2  1
B  4  0 
B  2  1
B  2  1
C  5  0
C  6  1
C  9  0

Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!

DXC
  • 75
  • 1
  • 7

3 Answers3

0

%in% to select the relevant rows would get your desired output.

> df
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1
7  C  5  0
8  C  6  1
9  C  9  0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1

dplyr::filter based solution:

> df %>% filter(ID  %in% boot.index)
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1
amrrs
  • 6,215
  • 2
  • 18
  • 27
  • @amrrs Half done, but I still need group A repeat after group B – DXC Oct 30 '17 at 14:33
  • @amrrs yes, that's the point of bootstrap - resample using the sample, so some observations mean to appear more than once. – DXC Oct 30 '17 at 14:34
  • That's based on the index right? so you've got A twice and it's repeating no? – amrrs Oct 30 '17 at 14:39
  • @ amrss I got A twice, so I need to select all the observations of A twice. – DXC Oct 30 '17 at 14:45
0
set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C

do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
#   ID v1 v2
#7   C  5  0
#8   C  6  1
#9   C  9  0
#71  C  5  0
#81  C  6  1
#91  C  9  0
#1   A  3  1
#2   A  1  0
#3   A  2  0
d.b
  • 32,245
  • 6
  • 36
  • 77
  • order is not important actually, whether it is `C C A` or `C A C` does not matter – DXC Oct 30 '17 at 14:39
0

You can also do this with a join:

boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)
ags29
  • 2,621
  • 1
  • 8
  • 14
  • @ ags29 Thank you. Your answer is smart too, but I prefer directly manipulating on the same data frame to creating new one and doing merge, so I accept d.b's answer. – DXC Oct 30 '17 at 14:48