-1

So my task is to break a dataframe of 506 observations into ten different samples of training and test sets (with replacement). I'm doing this so I can put it through a model and see the average MSE over ten samples. Thus far, I've got the following idiotically complicated for loop:

temp_train<- setNames(lapply(1:10, function(x) {x <-homeprices[sample(1:nrow(homeprices), 
.8*n, replace = FALSE), ]; x }), paste0("tr_sample.", 1:10))
for (i in 1:length(temp_train)) {
  assign(paste0("df_train_", i), as.data.frame(temp_train[i]))
  name<-assign(paste('df_train_', i, sep=''), x[i])
  temp_test<- setNames(homeprices[-name], paste0("te_sample.", 1:10))
  alpha<-assign(paste0("df_test_", i), as.data.frame(temp_test[i]))
}

This for loop produces say df_test_2, which is a data frame of 506 observations of one variable. It SHOULD be a dataframe of 102 obvs of 13 variables, namely the 102 observations that are NOT in df_train_2. My question therefore is what's a better way to do this that actually works? I would prefer to not install any packages if possible since I want to get a grasp of base r.

CapnShanty
  • 529
  • 8
  • 20

1 Answers1

1

A common (and efficient) strategy for handling this type of task in base R is not to create each individual data frame, but to simply create a set of indices that define the partition.

For example,

x <- replicate(n = 10,expr = {sample(506,404)})

creates a matrix where each of the ten columns is filled with the row indices of a random selection of 404 rows (80% or so of 506). Then you'd loop through your model fitting and use the columns of x to select the training subset of your data that you pass to your model. Negative indexing of the same indices would yield the corresponding 20% for testing.

This way you don't have tons of copies of data frames lying about.

joran
  • 169,992
  • 32
  • 429
  • 468