0

Good Afternoon.

I wanted a sanity check after doing research about k-Fold Cross-Validation. I will provide my understanding, and then provide an example of how to execute the preconceived understanding in R.

I would really appreciate any help on if I'm thinking about this incorrectly, or if my code is not reflecting my thought process / the correct procedures. Take the basic predictive modeling scenario on a continuous response variable:

  • Have a population dataset (xDF)
  • I want to split the dataset into k=10 separate parts, train a model on 9 of them (binded), and then validate on the remaining validation set
  • I then want to loop through each validation set to observe how the model performs on un-trained segments of the data
  • Model performance measures (RMSE for this example) on the kth-fold validation set that display similar results on the k+1...k+9th validation set reveals that the model is well-generalized

R Code:

#Declaring randomly sampled validation indices

ind <- sample(seq_len(nrow(xDF)), size = nrow(xDF))



n <- (nrow(xDF)/10)
nr <- nrow(xDF)
validation_ind <- split(ind, rep(1:ceiling(nr/n), each=n, length.out=nr))

#Looping through validation sets to obtain Model Performance measure of each set
RMSEsF <- double(10)
RMSEsFT <- double(10)
R2F <- double(10)
R2FT <- double(10)
rsq <- function (x, y) cor(x, y) ^ 2

for (i in 1:10){


    validate = as.data.frame(xDF[unlist(validation_ind[i]),])
    train = as.data.frame(xDF[unlist(validation_ind[-i]),])


    rf_train = randomForest(y~.,data=train,mtry=3)


    predictions_rf = predict(rf_train,validate)
    predictions_rft = predict(rf_train, train)


    RMSEsF[i] = RMSE(predictions_rf, validate$y)
    RMSEsFT[i] = RMSE(predictions_rft, train$y)
    R2F[i] = rsq(predictions_rf, validate$y)
    R2FT[i] = rsq(predictions_rft, train$y)

    print(".")


}

RMSEsF
RMSEsFT

Am I going about this correctly?

Many thanks in advance.

Kyle
  • 11
  • 2

0 Answers0