calculating mean squared prediction error based on simulated data in R

Question

I have following data. My goal is to calculate mean squared prediction error (MSPE)using cross validation by replicating 100 times.

y=rnorm(250,0,2)
    x1=rnorm(250,0,3)
    x2=rnorm(250,1,1)
    datasim=data.frame(y,x1,x2)

From this first i need to split the data in to training and test data. So i first calculated the indices using sample.int function in R. Based on those indices i split the data into training and test set.

    dd=replicate(100,sample.int(n = nrow(datasim), 
size = floor(.75*nrow(datasim)), replace = F))

          train_set=apply(dd,2,function(y)
          datasim[y, ])

        test_set=apply(dd,2,function(y)
          datasim[-y, ])

After that i have to use the training data to fit the model. And based on test data , i need to predict and obtain mean squared prediction error (MSPE). I don't know how to proceed from here. Especially i dont know how to link the training set and test set so that i can predict and calculate MSPE.

I tried it using this using a lapply function which is inside of another lapply function.

lapply(test_set, function(train_set) {
  lapply(train_set,function(x)
    mean((test_set$y- predict.lm(y ~ x1 + x2, data = train_set))^2)

}
  ))

But there seems to be a problem with this.Can anyone help me to figure out this ? Also is there any easier method to do this than this method ?

Thank you

RLave · Accepted Answer · 2018-12-12T08:12:10.417

To replicate remember that you need to pass a function.

This is something to get you going, also you should predict on test data, use newdata in the call to predict.

First, this function does all the part about splitting data and the model, notice that you can pass different perc, if you want to change that later.

sim_function <- function(datas, perc=0.75) {
  idx = sample(nrow(datas), floor(perc*nrow(datas)), replace = F) # sample idx

  train = datas[idx, ]
  test = datas[-idx, ]

  pred_lm = predict(lm(y~x1+x2,data=train), # model on train data
              newdata = test[, -1]) # predict on test data

  return(mean((test$y - pred_lm)^2)) # mse and return it
}

Now we can call replicate:

sim_rep <- replicate(100, sim_function(datasim)) # or sim_function(datasim, perc = 0.60) as an example
head(sim_rep)
[1] 4.664940 3.543390 3.119503 3.493320 4.182965 5.101870

Data:

set.seed(123) # always remember this when you simulate
y=rnorm(250,0,2)
x1=rnorm(250,0,3)
x2=rnorm(250,1,1)
datasim=data.frame(y,x1,x2)

calculating mean squared prediction error based on simulated data in R

1 Answers1