0

I need to backtest a predictive model in R by using cross-validation methodology.

So, I should select 4 out of 5 of the total observations in the dataset for training purposes and use the remaining one for testing.

Now, let's assume x is the total dataset, composed by 100 observations, I know that you can select a sub-sample x1 in R by typing:

x1 <- x[1:80, ]

In this way, I selected the first 4 out of fifth observations in the dataset.

What I should do to select the second 4 out of fifth sub-sample, that is the observations [1:20] and [40:100]?

Any hint will be appreciated. In the case the question will result to be unclear, ping me in the comment, please.

QuantumGorilla
  • 583
  • 2
  • 10
  • 25

3 Answers3

5

You can use the caret package has a lot of useful functions for predictive modeling. The createDataPartition function works well to create test and training partitions, but it's random. There's no guarantee that every value will show up in the training/test sets exactly 4 times, as it would be if you split manually using: x[1:80], x[c(1:20,41:100)], x[c(1:40,61:100)], x[c(1:20,81:100)], x[21:100].

Here is an example using createDataPartition:

set.seed(1001)
x<- sample(1:1000, 100)

library(caret)
folds <- createDataPartition(x, times=5, p = 4/5) # p = percentage of data to include
                                                  # times = number of partitions

folds contains indices of values from x, so you use it like this:

x[folds[[1]]] # first training set
x[-folds[[1]]] # first test set

x[folds[[2]]] # second training set
x[-folds[[2]]] # second test set

# and so on
Jota
  • 17,281
  • 7
  • 63
  • 93
4

This is a typical task in machine learning. It is usually not recommended to take a consecutive set of data, like the first 80 out of 100 rows, since the data may have been collected in an ordered fashion and the remaining 20 rows (observations) could contain significantly different properties. The generally accepted solution is to take a random set (sample) of a pre-defined size from the total data, often somewhere between 70% and 80%, and use this as a training set while the remainder is the test set.

A simple way to achieve such a split of the data is to create a dummy index:

ind <- sample(2,nrow(x), replace=TRUE, prob=c(0.7,0.3))

Then the training set and the test set can be separated easily:

train_data <- x[ind==1,]
test_data <- x[ind==2,]

Note that with this method the set is usually not split exactly into 70% and 30%. The training set may, e.g., represent 75% of the total data while the test set consists of the remaining 25%. In any case the entire set is split in two parts that roughly correspond in their relative size to the parameters specified in the prob attribute in the sample() function. Such fluctuations are acceptable for usual machine learning tasks, where the ratio of training set size to test set size does not need to be defined precisely.

Hope this helps.

RHertel
  • 23,412
  • 5
  • 38
  • 64
2

If you wanted specifically to do leaving out specific sets of twenty you could do something like this:

train_test_groups <- function(data, test_group, n_groups) {
  group_size <- nrow(data) %/% n_groups
  if (test_group == n_groups) {
    # last group makes up the numbers if the data don't split up evenly
    test_indices <- (group_size * (test_group - 1) + 1):nrow(data)
  } else {
    test_indices <- 1:group_size + group_size * (test_group - 1)
  }
  list(train = data[-test_indices, ],
       test = data[test_indices, ])
}

Example:

my_data <- data.frame(x = 1:100, y = rnorm(100))
first_groups <- train_test_groups(my_data, 1, 5)
first_groups$train
first_groups$test
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52