R: select rows by group after resampling

Question

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.

df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))

boot.index <- sample(unique(df$ID), replace = TRUE)

Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this

Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!

score 0 · Answer 1 · answered Oct 30 '17 at 14:17

0

%in% to select the relevant rows would get your desired output.

> df
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1
7  C  5  0
8  C  6  1
9  C  9  0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1

dplyr::filter based solution:

> df %>% filter(ID  %in% boot.index)
  ID v1 v2
1  A  3  1
2  A  1  0
3  A  2  0
4  B  4  0
5  B  2  1
6  B  2  1

answered Oct 30 '17 at 14:17

amrrs

6,215
2
18
27

@amrrs Half done, but I still need group A repeat after group B – DXC Oct 30 '17 at 14:33
@amrrs yes, that's the point of bootstrap - resample using the sample, so some observations mean to appear more than once. – DXC Oct 30 '17 at 14:34
That's based on the index right? so you've got A twice and it's repeating no? – amrrs Oct 30 '17 at 14:39
@ amrss I got A twice, so I need to select all the observations of A twice. – DXC Oct 30 '17 at 14:45

d.b · Accepted Answer · 2017-10-30T14:41:45.863

0

set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C

do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
#   ID v1 v2
#7   C  5  0
#8   C  6  1
#9   C  9  0
#71  C  5  0
#81  C  6  1
#91  C  9  0
#1   A  3  1
#2   A  1  0
#3   A  2  0

edited Oct 30 '17 at 14:41

answered Oct 30 '17 at 14:23

d.b

32,245
6
36
77

order is not important actually, whether it is `C C A` or `C A C` does not matter – DXC Oct 30 '17 at 14:39

score 0 · Answer 3 · answered Oct 30 '17 at 14:34

0

You can also do this with a join:

boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)

answered Oct 30 '17 at 14:34

ags29

2,621
1
8
14

@ ags29 Thank you. Your answer is smart too, but I prefer directly manipulating on the same data frame to creating new one and doing merge, so I accept d.b's answer. – DXC Oct 30 '17 at 14:48

R: select rows by group after resampling

3 Answers3