1

I'm currently writing a tutorial about bootstrapping in R. I settled on the function boot in the boot package. I got the book "An introduction to the Bootstrap" by Efron/Tibshirani (1993) and just replicate a few of their examples.

Quite often in those examples, they compute statistics based on different samples. For instance, they have this one example where they have a sample of 16 mice. 7 of those mice received a treatment that was meant to prolong survival time after a test surgery. The remaining 9 mice did not receive the treatment. For each mouse, the number of days it survived was collected (values are given below).

Now, I want to use the bootstrapping approach to find out if the difference of mean is significant or not. However, if I understand the help page of boot correctly, I can't just pass two different samples with unequal sample size to the function. My workaround is as follows:

#Load package boot
library(boot)
#Read in the survival time in days for each mouse
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control   <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
#Call boot twice(!)
b1 <- boot(data = treatment,
           statistic = function(x, i) {mean(x[i])},
           R = 10000)
b2 <- boot(data = control,
           statistic = function(x, i) {mean(x[i])},
           R = 10000)
#Compute difference of mean manually
mean_diff <- b1$t -b2$t

In my opinion, this solution is a bit of a hack. The statistic I'm interested in is now saved in a vector mean_diff, but I don't get all the great functionality of the boot package anymore. I can't call boot.ci on mean_diff, etc.

So my question basically is if my hack is the only way to do a bootstrap with the boot package in R and statistics that compare two different samples. Or is there another way?

I thought about passing one data.frame in with 16 rows and an additional column "Group":

df <- data.frame(survival=c(treatment, control), 
                 group=c(rep(1, length(treatment)), rep(2, length(control))))
head(df)
  survival group
1       94     1
2      197     1
3       16     1
4       38     1
5       99     1
6      141     1

However, now I would have to tell boot that it has to sample always 7 observations from the first 7 rows and 9 observations from the last 9 rows and treat these as separate samples. I would not know how to do that.

Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • Why aren't you using `t.test`????? – IRTFM Aug 15 '13 at 16:44
  • @Dwin I know that I can run `t.test(df$survival ~ df$group)` as an alternative to `boot`. However, this is not my question here (I actually have that part in my tutorial). The question is about the general case in which I want to apply a bootstrap to a statistic that compares two samples. The difference of mean test was just an example. Or did you have something in mind that combines `t.test` and `boot`? In that case, it would be great if you could share that solution because I don't quite see how. – Christoph_J Aug 15 '13 at 17:14
  • I was thinking you could use t.test(...)$t as your boot statistic if you could stratify the sampling properly. – IRTFM Aug 15 '13 at 19:46

3 Answers3

1

I've never really figured out what the big advantage of boot is, since it is so easy to manually code bootstrap procedures. You could try for example the following using replicate:

myboot1 <- function(){
    booty <- tapply(df$survival,df$group,FUN=function(x) sample(x,length(x),TRUE))
    sapply(booty,mean)
}
out1 <- replicate(1000,myboot1())

From this you can get a bunch of useful statistics quite easily:

rowMeans(out1) # group means
diff(rowMeans(out1)) # difference
mean(out1[1,]-out1[2,]) # another way of getting difference
apply(out1,1,quantile,c(0.025,0.975)) # treatment-group CIs
quantile(out1[1,]-out1[2,],c(0.025,0.975)) # CI for the difference
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • As I wrote, `boot.ci` is pretty neat. Also, my "hack" doesn't look anymore complicated than yours, so it's not that `boot` makes things more complicated in the case with different samples (even if my hack is the best possible solution). But of course, you have a point. There are other solutions to run a bootstrap (like your nice one) that are not any more complicated than using `boot`. – Christoph_J Aug 15 '13 at 15:55
  • Actually, thanks to your answer I found a way to combine it with `boot`, so +1. Thanks! – Christoph_J Aug 15 '13 at 17:27
1

This is an example in ?boot.return:

diff.means <- function(d, f)
{    n <- nrow(d)
     gp1 <- 1:table(as.numeric(d$series))[1]
     m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
     m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
     ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 *  m1 * sum(f[gp1]))
     ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 *  m2 * sum(f[-gp1]))
     c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])

Section3.2 of Davison and Hinkley can be referenced.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I think that this example only works for samples of equal size, but it should be pretty straightforward to make it more general. I think one would just need a `gp2 <- seq(from=table(as.numeric(d$series))[1]+1,by=1, length.out=table(as.numeric(d$series))[2])` and adjust the rest accordingly. So thanks. However, this example made me wonder what `f` actually is. I thought it's something like `sample(1:26, 26, TRUE)`, but this example seems to indicate it's a vector of booleans. Now I'm a little confused...Tried to check out the source, but this looks too complicated for a Friday morning.... – Christoph_J Aug 16 '13 at 08:00
  • 1
    It can be a vector of numeric indices, logicals or weights. You use the minus sign on from of a vector of integers to remove items from a dataframe or matrix. Notice that if these were logicals, that the "-" would not be the correct method of negation. – IRTFM Aug 16 '13 at 16:04
0

Giving it another thought, I realized that I could actually combine Thomas' answer with boot. Here is a solution:

b <- boot(data=df, 
           statistic = function(x, i) {
             booty <- tapply(x$survival,x$group,FUN=function(x) sample(x,length(x),TRUE))
             diff(sapply(booty,mean))*-1
           },
           R=10000)

The trick is that the function you provide to the argument statistic has to accept a parameter i for the index, but that you completely ignore this parameter within your function. Instead, you do the sampling yourself. Of course, this is not the most efficient (because boot has to do the sampling as well), but I guess that in most cases this shouldn't be a big issue.

Christoph_J
  • 6,804
  • 8
  • 44
  • 58