4

I have been using SAS for a long time and now I would like to translate my codes in R. I need help to do the following:

  1. Generate several bootstrap samples
  2. Run a linear regression model on each of the samples
  3. Store the parameters in a new dataset by replicate samples

I edited this code for better clarity. I used a lot of for loops, which I know is not always recommended.This process is very slow

Is there a code/package (e.g. apply family functions, "caret" package) that can make this very clean efficient/fast especially as samplesize*bootsample >10 Million

Any help would be much appreciated.

samplesize <- 200
bootsize<- 500
myseed <- 123

#generating a fake dataset
       id=1:n
       set.seed(myseed)
       x <-  rnorm(samplesize, 5, 5)
       y <-  rnorm(samplesize, 2 + 0.4*x, 0.5)
      data <- data.frame(id, x, y)

head(data)
  id         x        y
1  1  2.197622 3.978454
2  2  3.849113 4.195852
3  3 12.793542 6.984844
4  4  5.352542 4.412614
5  5  5.646439 4.051405
6  6 13.575325 7.192007

# generate bootstrap samples

bootstrap <- function(nbootsamples, data, seed) {
   bootdata <-  data.frame() #to initialize it
   set.seed(seed) 
   for (i in 1:nbootsamples) {
     replicate <- i
     bootstrapIndex <- sample(1:nrow(data), replace = TRUE)
     datatemp <- data[bootstrapIndex, ]
     tempall <- cbind(replicate, datatemp)
     bootdata <- rbind(bootdata, tempall)
   }
   return(bootdata)
 }
 bootdata <- bootstrap(nbootsamples=bootsize, data=data, seed=myseed)
 bootdata <- dplyr::arrange(bootdata, replicate, id)
 head(bootdata)
 #The data should look like this
  replicate id         x        y
1         1  1  2.197622 3.978454
2         1  3 12.793542 6.984844
3         1  5  5.646439 4.051405
4         1  9  1.565736 3.451748
5         1 10  2.771690 3.081662
6         1 10  2.771690 3.081662

#Model-fitting and saving coefficient and means

modelFitting <- function(y, x, data) {
   modeltemp <-  glm(y ~ x,
     data = data,
     family = gaussian('identity'))
   Inty <-  coef(modeltemp)["(Intercept)"]
   betaX <-  coef(modeltemp)["x"]
   sdy <-  sd(residuals.glm(modeltemp))
   data.frame(Inty, betaX, sdy, row.names = NULL)
 }

saveParameters <- function(nbootsamples, data, seed) {
   parameters <-  data.frame() #to initialize it
   for (i in 1:length(unique(data$replicate))) {
     replicate <- i
     datai <- data[ which(data$replicate==i),]
     datatemp <- modelFitting(y, x,data=datai)
     meandata <- data.frame(Pr_X=mean(datai$x))
     tempall <- cbind(replicate, datatemp, meandata)
     parameters <- rbind(parameters, tempall)
   }
   return(parameters)
 }
parameters <- saveParameters(nbootsamples=bootsize, data=bootdata, seed=myseed)
 head(parameters)

#Ultimately all I want is my final dataset to look like the following

  replicate     Inty     betaX       sdy     Pr_X
1         1 2.135529 0.3851757 0.5162728 4.995836
2         2 1.957152 0.4094682 0.5071635 4.835884
3         3 2.044257 0.3989742 0.4734178 5.111185
4         4 2.093452 0.3861861 0.4921470 4.741299
5         5 2.017825 0.4037699 0.5240363 4.931793
6         6 2.026952 0.3979731 0.4898346 5.502320
SimRock
  • 229
  • 3
  • 10

1 Answers1

2

Regression with resampling is easily accomplished with the caret package. Given your example data, code to run 200 bootstrap samples through a generalized linear model looks like this.

library(caret)
x = round(rnorm(200, 5, 5))
y= rnorm(200, 2 + 0.4*x, 0.5)
theData <- data.frame(id=1:200,x, y)
# configure caret training parameters to 200 bootstrap samples
fitControl <- trainControl(method = "boot",
                           number = 200)
fit <- train(y ~ x, method="glm",data=theData,
             trControl = fitControl)
# print output object
fit
# print first 10 resamples 
fit$resample[1:10,]

Output from caret looks like this:

> fit
Generalized Linear Model 

200 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (200 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.4739306  0.9438834  0.3772199

> fit$resample[1:10,]
        RMSE  Rsquared       MAE    Resample
1  0.5069606 0.9520896 0.3872257 Resample001
2  0.4636029 0.9460214 0.3711900 Resample002
3  0.4446103 0.9549866 0.3435148 Resample003
4  0.4464119 0.9443726 0.3636947 Resample004
5  0.5193685 0.9191259 0.4010104 Resample005
6  0.4995917 0.9451417 0.4044659 Resample006
7  0.4347831 0.9494606 0.3383224 Resample007
8  0.4725041 0.9483434 0.3716319 Resample008
9  0.5295650 0.9458453 0.4241543 Resample009
10 0.4796985 0.9514595 0.3927207 Resample010
> 

Details about how to use caret including the contents of the resulting model objects (e.g. accessing the individual models so you can generate predictions with the predict() function for a simulation) are available at the caret GitHub site.

Caret also supports parallel processing. For an example of how to use parallel processing with caret, please read Improving Performance of Random Forest with caret::train().

Also, Monte Carlo simulations are supported in R through the Monte Carlo package in R.

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Thank you @Len Greski. I edited the question and code for better clarity. I tried the `caret package` but I couldn't figure out how to get the desired result. Any advice.Thank you – SimRock Dec 07 '17 at 04:38