My question is very similar to the one asked in caret: combine createResample and groupKFold
The only difference: I need to create stratified folds (also repeated 10 times) after grouping instead of bootstrapped resamples (which are not stratified as far as I know) for using it with caret's trainControl.
The following code is working with 10-fold repeated CV but I couldn't include the grouping of the data based on an "ID" (df$ID
).
# creating indices
cv.10.folds <- createMultiFolds(rf_label, k = 10, times = 10)
# creating folds
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
# train
rf.ctrl10 <- train(rf_train, y = rf_label, method = "rf", tuneLength = 6,
ntree = 1000, trControl = ctrl.10fold, importance = TRUE)
That's my actual problem: My data contains many groups composed of 20 instances each, having the same "ID". So, when using the 10-fold CV repeated 10 times I get some instances of a group in the training and some in the validation set. This I want to avoid, but overall I need a stratified partitioning for the prediction value (df$Label
). (All instances having the same "ID" also have the same prediction/label value.)
In the provided and accepted answer from the link above (see parts below) I guess I have to modify the folds2
line to contain the stratified 10-fold CV instead of the bootstrapped
folds <- groupKFold(x)
folds2 <- lapply(folds, function(x) lapply(1:10, function(i) sample(x, size = length(x), replace = TRUE)))
but unfortunately I cannot figure out how exactly. Could you help me with that?