I'm currently writing a tutorial about bootstrapping in R
. I settled on the function boot
in the boot
package. I got the book "An introduction to the Bootstrap" by Efron/Tibshirani (1993) and just replicate a few of their examples.
Quite often in those examples, they compute statistics based on different samples. For instance, they have this one example where they have a sample of 16 mice. 7 of those mice received a treatment that was meant to prolong survival time after a test surgery. The remaining 9 mice did not receive the treatment. For each mouse, the number of days it survived was collected (values are given below).
Now, I want to use the bootstrapping approach to find out if the difference of mean is significant or not. However, if I understand the help page of boot
correctly, I can't just pass two different samples with unequal sample size to the function. My workaround is as follows:
#Load package boot
library(boot)
#Read in the survival time in days for each mouse
treatment <- c(94, 197, 16, 38, 99, 141, 23)
control <- c(52, 104, 146, 10, 51, 30, 40, 27, 46)
#Call boot twice(!)
b1 <- boot(data = treatment,
statistic = function(x, i) {mean(x[i])},
R = 10000)
b2 <- boot(data = control,
statistic = function(x, i) {mean(x[i])},
R = 10000)
#Compute difference of mean manually
mean_diff <- b1$t -b2$t
In my opinion, this solution is a bit of a hack. The statistic I'm interested in is now saved in a vector mean_diff
, but I don't get all the great functionality of the boot
package anymore. I can't call boot.ci
on mean_diff
, etc.
So my question basically is if my hack is the only way to do a bootstrap with the boot
package in R
and statistics that compare two different samples. Or is there another way?
I thought about passing one data.frame in with 16 rows and an additional column "Group":
df <- data.frame(survival=c(treatment, control),
group=c(rep(1, length(treatment)), rep(2, length(control))))
head(df)
survival group
1 94 1
2 197 1
3 16 1
4 38 1
5 99 1
6 141 1
However, now I would have to tell boot
that it has to sample always 7 observations from the first 7 rows and 9 observations from the last 9 rows and treat these as separate samples. I would not know how to do that.