0

This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).

The data I am working with looks like this:

df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))

Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.

I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!

André.B
  • 617
  • 8
  • 17
  • 1
    you can proceed as in the linked answer but instead of splitting by time, split by a dummy variable which is an integer sequence `1:n()`. If still having problems I can post an answer with code. – missuse Nov 22 '18 at 07:09
  • I am not sure exactly sure how to implement and I think I may have been a little misleading with how the data was represented... I have just updated the question to have a more representative dataset. Sorry for any trouble this might have caused and thank you again for the help! – André.B Nov 30 '18 at 03:00

1 Answers1

1

Here is one way you could create the desired partition using tidyverse:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
  filter(Time > 15) %>% #subset data frame according to your description 
  split(.$id)  %>% #split the data frame into lists by id (row number)
  map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
        unlist %>%
        unname) -> folds_cv

EDIT: it seems indexOut argument does not perform as expected, but the index argument does so after making folds_cv one can just get the inverse using setdiff:

folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))

and now:

test_control <- trainControl(index = folds_cv,
                             savePredictions = "final")


quad.lm2 <- train(Time ~ Effect,
                  data = df,
                  method = "lm",
                  trControl = test_control)

with a warning:

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> quad.lm2
Linear Regression 

200 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (50 reps) 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE          Rsquared  MAE         
  3.552714e-16  NaN       3.552714e-16

Tuning parameter 'intercept' was held constant at a value of TRUE

so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:

quad.lm2$pred

Why Rsquared is missing I am not sure I will dig a bit deeper.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • Hey @missuse, I have just gotten around to running this again and it looks like there is a slight issue with the code - the above will spit out a list of single integers to be used as test sets rather than training sets. Is there a way to invert it? I think train control needs the training sets specified rather than the test sets. Sorry for the trouble and thanks again for the help! – André.B Dec 17 '18 at 21:59
  • 1
    You can specify the test indexes in `trainControl` using the argument `indexOut`. All others will be used for training. As specified in my answer: "#clean up so it works with indexOut argument in trainControl" – missuse Dec 17 '18 at 22:02
  • 1
    I gave that a try as suggested but I am getting this error with the test data: `test_control <- trainControl(indexOut = folds_cv, method = "cv")` and then `quad.lm2 <- train(Time ~ Effect, data = df, method = "lm", trControl = test_control) ` Any idea what I am doing wrong @missuse? – André.B Dec 17 '18 at 23:42
  • You are correct. It appears not to be working although I am sure I have used it in some previous caret version successfully. I have edited the answer with a working example. Still there is a minor problem in `R2` which I will try to get to. – missuse Dec 18 '18 at 07:36
  • I suspect that it might be spitting out NaN's for the R^2 because one can't tell how well one variable is correlated with another if one only has a single point to draw upon. What do you think? – André.B Dec 18 '18 at 22:39