0

When I create bootstrap samples from a data frame datta using the following codes

boot1a <- replicate(3, do.call("rbind", lapply(sample(unique(datta$pid),2000,replace=TRUE), function(x) datta[datta$pid==x,])), simplify=FALSE)
boot1b <- data.frame(boot1a) # data frame from the list
sample1 <- boot1b[order(boot1b$pid),] # sorting based on pid and storing 

variables in the bootstrap sample sample1 have names ending with .1, .2, .3, .... (pid is person ID, it takes similar values for different observations of the same person). For instance, with the above code a variable xy in datta will have names xy, xy.1, and xy.2 associated with the first, the second and the third bootstrap samples. I would rather prefer to have different bootstrap samples named differently with the variable names in each remaining the same as those in the original data frame. In the above case, what I would like to have would be bootstrap samples stored in three different data frames, say, boot1, boot2, boot3, where variable names in each data frame is the same as that in the original data frame. I began doing it manually with one replication at a time, but it is gonna take a lot of time to create many bootstrap samples. Does somebody has any suggestion on how to do this in a better way?

EDIT The first few observations for four of the many variables in the data frame datt is as follows.

    pid xy  zy  wy
     1  10  2   -5
     1  12  3   -4.5
     1  14  4   -4
     1  16  5   -3.5
     1  18  6   -3
     1  20  7   -2.5
     2  22  8   -2
     2  24  9   -1.5
     2  26  10  -1
     2  28  11  -0.5
     2  30  12  0
     2  32  13  0.5
Duna
  • 725
  • 2
  • 7
  • 16
  • 1
    I can't reproduce your code (using `mydata <- within(mtcars, pid <- mpg)`). I think you mean `boot1b <- do.call(rbind, boot1a)` in the second line. – Ferdinand.kraft Aug 26 '13 at 13:25
  • @Ferdinand.kraft Sorry if it is confusing, but the data frame `mydata` is my own data set not the one in `r`. – Duna Aug 26 '13 at 13:48
  • @Duna: Please provide the sample data for reproducibility – Metrics Aug 26 '13 at 14:10
  • 1
    @Metrics Sample data is provided now. In the sample data the value of `pid` is limited to 2, unlike 2000 in the original data frame. – Duna Aug 26 '13 at 14:31
  • Unfortunately, I was not able to replicate the results, but did post an answer if it is what you are looking after. – Metrics Aug 26 '13 at 17:18
  • @Metrics I appreciate what you did. Did you try this `pid = c(1,1,2,2,1,1,2,2,1,2); mm=c(pid, seq(1:10), seq(11:20)); m=matrix(mm, nrow=10, ncol=3, byrow=F); colnames(m)=c("pid", "x", "y"); m = data.frame(m); boota <- replicate(3, do.call("rbind", lapply(sample(unique(m$pid),2,replace=TRUE), function(x) m[m$pid==x,])), simplify=FALSE); bootb <- data.frame(boota) # data frame from the list; bootb` I am getting three re-samples with these codes. I would like to store each re-sample `r`in a separate variable `boot.r`. – Duna Aug 26 '13 at 17:36
  • No, you are not resampling. If you do `identical(boota[[1]],m) [1] TRUE` which means you are getting the same data `m`. In my answer, I was assuming that you have unique factor (like the pid) and you want to draw only one observation from each factor. Are you doing the same thing here? – Metrics Aug 26 '13 at 17:43
  • @Metrics If `pid=r` is in the bootstrap sample `boot.r` then I would like to have all of `r`'s observations in `boot.r`. Precisely, I want the indexing for bootstrapping to be based on person ID; i.e., if a person is selected in the bootstrap sample so shall all of the observations belonging to that person. – Duna Aug 26 '13 at 17:49
  • You are correct in that sense. But, I think you are introducing the bias in the data. I think your sampling should make sure that you have data for all pids's. – Metrics Aug 26 '13 at 18:00

1 Answers1

2

Here is the sample example:

Data

set.seed(123)
data<-rnorm(100, 160, 20)
data1<-as.data.frame(matrix(data, nrow = 20, ncol = 5, byrow = FALSE))
n<-5
data2<-do.call("rbind", replicate(n, data1, simplify=FALSE))
data2$fac<-as.factor(rep(1:n,each=20))

Sampling

library(plyr)
sample1<-ddply(data2,.(fac),summarize, mysample=sample((1:length(fac)),size=1,replace=TRUE))
  fac mysample
1   1       18
2   2       14
3   3       13
4   4       20
5   5       14
Metrics
  • 15,172
  • 7
  • 54
  • 83