Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

Question

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

Thanks in advance!

It is not clear for me what exactly would you like done. Could you show an example of the test folds (the expected outcome) on the posted data? — missuse, Oct 16 '18 at 09:24
Sorry and thank you for the advice! Please see the edit above! — André.B, Oct 17 '18 at 00:07

missuse · Accepted Answer · 2018-11-19T09:09:07.077

1

For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

example with two rows per each time:

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

with unequal number of rows per time

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

Now you can define these hold out folds inside trainControl with the argument indexOut.

EDIT: to get similar output as caret::groupKFold one can:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname

edited Nov 19 '18 at 09:09

answered Oct 17 '18 at 08:17

missuse

19,056
3
25
47

1

Hey missuse, thank you for the help! I have only just gotten back around to looking at this stuff and have hit a slight problem. The output of the above code is lists with tibbles with a single integer column but the trainControl function requires lists that contain a single integer vector. I have had a play with it but I am not very familiar with tidyverse and haven't been able to alter it to give the desired output. The required format is shown in `caret::groupKfold(data$Time)` line. Thank you in advance! – André.B Nov 18 '18 at 22:16
you can just add an `unlist` ti the map call. Check edit. – missuse Nov 19 '18 at 09:08
Thank you again for the help @missuse! I have come up on another issue that I hadn't foreseen - If I have data structured like this: `df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))`. Is there a way to adapt your code to take each time point in the upper bound (say > 15) and create a fold out of each row? I.e. each time point in the upper bracket gets used as the test set once, while all other data is used for training. – André.B Nov 19 '18 at 21:55
Something like leave one out CV but just with a specified subset of all of the data? Yes there is, but I think it would be better if you posted this as a separate question since multiple sub questions in comments often leads to an answer that is often hard to interpret for others. – missuse Nov 20 '18 at 08:23

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

1 Answers1

Linked