Specifiying a selected range of data to be used in leave-one-out (jack-knife) cross-validation for use in the caret::train function

Question

This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).

The data I am working with looks like this:

df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))

Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.

I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!

you can proceed as in the linked answer but instead of splitting by time, split by a dummy variable which is an integer sequence `1:n()`. If still having problems I can post an answer with code. — missuse, Nov 22 '18 at 07:09
I am not sure exactly sure how to implement and I think I may have been a little misleading with how the data was represented... I have just updated the question to have a more representative dataset. Sorry for any trouble this might have caused and thank you again for the help! — André.B, Nov 30 '18 at 03:00

missuse · Accepted Answer · 2018-12-18T07:35:20.357

1

Here is one way you could create the desired partition using tidyverse:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
  filter(Time > 15) %>% #subset data frame according to your description 
  split(.$id)  %>% #split the data frame into lists by id (row number)
  map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
        unlist %>%
        unname) -> folds_cv

EDIT: it seems indexOut argument does not perform as expected, but the index argument does so after making folds_cv one can just get the inverse using setdiff:

folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))

and now:

test_control <- trainControl(index = folds_cv,
                             savePredictions = "final")


quad.lm2 <- train(Time ~ Effect,
                  data = df,
                  method = "lm",
                  trControl = test_control)

with a warning:

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> quad.lm2
Linear Regression 

200 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (50 reps) 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE          Rsquared  MAE         
  3.552714e-16  NaN       3.552714e-16

Tuning parameter 'intercept' was held constant at a value of TRUE

so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:

quad.lm2$pred

Why Rsquared is missing I am not sure I will dig a bit deeper.

edited Dec 18 '18 at 07:35

answered Nov 30 '18 at 08:11

missuse

19,056
3
25
47

Hey @missuse, I have just gotten around to running this again and it looks like there is a slight issue with the code - the above will spit out a list of single integers to be used as test sets rather than training sets. Is there a way to invert it? I think train control needs the training sets specified rather than the test sets. Sorry for the trouble and thanks again for the help! – André.B Dec 17 '18 at 21:59
1

You can specify the test indexes in `trainControl` using the argument `indexOut`. All others will be used for training. As specified in my answer: "#clean up so it works with indexOut argument in trainControl" – missuse Dec 17 '18 at 22:02
1

I gave that a try as suggested but I am getting this error with the test data: `test_control <- trainControl(indexOut = folds_cv, method = "cv")` and then `quad.lm2 <- train(Time ~ Effect, data = df, method = "lm", trControl = test_control) ` Any idea what I am doing wrong @missuse? – André.B Dec 17 '18 at 23:42
You are correct. It appears not to be working although I am sure I have used it in some previous caret version successfully. I have edited the answer with a working example. Still there is a minor problem in `R2` which I will try to get to. – missuse Dec 18 '18 at 07:36
I suspect that it might be spitting out NaN's for the R^2 because one can't tell how well one variable is correlated with another if one only has a single point to draw upon. What do you think? – André.B Dec 18 '18 at 22:39

Specifiying a selected range of data to be used in leave-one-out (jack-knife) cross-validation for use in the caret::train function

1 Answers1