2

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

Thanks in advance!

André.B
  • 617
  • 8
  • 17
  • It is not clear for me what exactly would you like done. Could you show an example of the test folds (the expected outcome) on the posted data? – missuse Oct 16 '18 at 09:24
  • Sorry and thank you for the advice! Please see the edit above! – André.B Oct 17 '18 at 00:07

1 Answers1

1

For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

example with two rows per each time:

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

with unequal number of rows per time

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

Now you can define these hold out folds inside trainControl with the argument indexOut.

EDIT: to get similar output as caret::groupKFold one can:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname
missuse
  • 19,056
  • 3
  • 25
  • 47
  • 1
    Hey missuse, thank you for the help! I have only just gotten back around to looking at this stuff and have hit a slight problem. The output of the above code is lists with tibbles with a single integer column but the trainControl function requires lists that contain a single integer vector. I have had a play with it but I am not very familiar with tidyverse and haven't been able to alter it to give the desired output. The required format is shown in `caret::groupKfold(data$Time)` line. Thank you in advance! – André.B Nov 18 '18 at 22:16
  • you can just add an `unlist` ti the map call. Check edit. – missuse Nov 19 '18 at 09:08
  • Thank you again for the help @missuse! I have come up on another issue that I hadn't foreseen - If I have data structured like this: `df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))`. Is there a way to adapt your code to take each time point in the upper bound (say > 15) and create a fold out of each row? I.e. each time point in the upper bracket gets used as the test set once, while all other data is used for training. – André.B Nov 19 '18 at 21:55
  • Something like leave one out CV but just with a specified subset of all of the data? Yes there is, but I think it would be better if you posted this as a separate question since multiple sub questions in comments often leads to an answer that is often hard to interpret for others. – missuse Nov 20 '18 at 08:23