0

I want to create a custom bootstrap function. The reason for this are several.

  • Better understanding (or let's just say understanding) of the process
  • To extrapolate bootstrap resampling elsewhere without dependancy of packages

I know there are some packages (mainly boot, rms, caret and others) that can help me with my issue and are pretty versatile but I want to be able to create a function by myself for the reasons stated above.

As far as my understanding goes, bootstrap is a resampling method consistent in taking n random samples from a sample (a dataframe in our case). Then using this n random samples to calculate estimates.

So, for example, say I fit a model (whatever, it really doesn't matter for my "example" code)

model <- coxph(Surv(time, cens)~groups, data=df)

I used survival because that's where I want to apply this right now, but because I'm interested on understanding what's really happening it really does not matter which model we choose.

Now, let's "resample". Theory-wise this is what I understand every time I read something about bootstrap

bstrap <- sample(df, 1000, replacement=T)
preds <- predict(model, bstrap)
mean(preds)
confint(preds) #This is probably the "faultiest" part, as C.I are supposed to be calculated by the bootstrap itself

Would something like this work? I can see some faulty things in there, but that's where my intuition on the topic drives me to think based on what I've read about bootstrap. Why wouldn't that work? Can it be something with the fact that I'm using exactly the same data I have fitted my model with? Is because the resampling is not so literal? Something else?

Many thanks!

jogo
  • 12,469
  • 11
  • 37
  • 42
N00b
  • 27
  • 1
  • 6

1 Answers1

1

I assume you want to bootstrap the predictions. Here is the basic implementation. (I use lm but it's the same with other models.)

mod <- lm(Sepal.Length ~ Petal.Length, data = iris)

preds <- predict(mod)

#bootstrap:

n <- 1000 #number of bootstrap resamples
bootpred <- matrix(ncol = length(preds), nrow = n)

set.seed(42) #for reproducibility

#loop over n
for (i in seq_len(n)) {
  bootdat <- iris[sample(n, replace = TRUE),] #bootstrap resample of data
  bootmod <- lm(Sepal.Length ~ Petal.Length, data = bootdat) #fit model to bootstrap resample
  bootpred[i,] <- predict(bootmod, newdata = iris) #calculate predictions from this model
}

CI <- apply(bootpred, 2, quantile, probs = c(0.025, 0.975)) #quantiles

plot(preds ~ Petal.Length, data = iris, pch = 16)
points(CI[1,] ~ Petal.Length, data = iris, col = "dark red", pch = 16)
points(CI[2,] ~ Petal.Length, data = iris, col = "dark red", pch = 16)

resulting plot with predictions and bootstrap CI

If you study this closely, you'll see that you were missing important steps, most importantly the loop and the refitting of the model.

Also, it is usually preferable to calculate a bias-corrected confidence interval.

And often it's better (more stable) to actually do residual bootstrapping instead of normal bootstrapping.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • So, if I'm understanding this right, you are resampling, then fitting a model with that resampled observations and then fitting the whole sample with your "resampled model". What's the difference doing it this way compared to my resampling proposal? I can imagine it has something to do with fitting a bigger sample to the already "full" model. But if that's the case, then I can't use say a bootstrap = 1000 on a 400 observation model... Am I right? Thanks! – N00b Aug 01 '19 at 13:08
  • If you predict on resampled values without refitting the model, you get exactly the same prediction (at these values). What would be the purpose of that exercise? – Roland Aug 01 '19 at 13:20
  • Of course you can set `n <- 1000` if you have 400 rows in your input data.frame. – Roland Aug 01 '19 at 13:21
  • Yeah, you are right, I was focusing on using all my data to fit the model I wasn't really aware of the nonsense of fitting resampled "fitted" data. My bad. So, on the second part of my question: is this reliable? Is this a true bootstrap resampling? I mean, aren't we losing power fitting a model on a resampled set instead of our "original" sample? If someone would use this "bootstrap" instead of a pre-written one (for example of the packages I mentioned on the post), would that be ok, or is this just some sort of "plausible but not advisable" thing? – N00b Aug 01 '19 at 13:40
  • 1
    I suggest you go and do some reading. I can't really give you a statistics lecture via comments. My implementation here is equivalent to what the boot package does. Here I implemented [*case resampling*](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Case_resampling). As mentioned in the answer, I would usually do [*residual resampling*](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Resampling_residuals) for regression. I show how to do that with the boot package in this answer: https://stats.stackexchange.com/a/369613/11849 – Roland Aug 01 '19 at 13:47
  • "My implementation here is equivalent to what the boot package does", that's what I was asking, not for a statistics lecture. I will take your suggestion and read your answers though. Thanks! – N00b Aug 01 '19 at 13:51