3

My question is very similar to the one asked in caret: combine createResample and groupKFold

The only difference: I need to create stratified folds (also repeated 10 times) after grouping instead of bootstrapped resamples (which are not stratified as far as I know) for using it with caret's trainControl. The following code is working with 10-fold repeated CV but I couldn't include the grouping of the data based on an "ID" (df$ID).

# creating indices
cv.10.folds <- createMultiFolds(rf_label, k = 10, times = 10)
# creating folds    
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
# train
rf.ctrl10 <- train(rf_train, y = rf_label, method = "rf", tuneLength = 6,
                       ntree = 1000, trControl = ctrl.10fold, importance = TRUE)

That's my actual problem: My data contains many groups composed of 20 instances each, having the same "ID". So, when using the 10-fold CV repeated 10 times I get some instances of a group in the training and some in the validation set. This I want to avoid, but overall I need a stratified partitioning for the prediction value (df$Label). (All instances having the same "ID" also have the same prediction/label value.)

In the provided and accepted answer from the link above (see parts below) I guess I have to modify the folds2 line to contain the stratified 10-fold CV instead of the bootstrapped

folds <- groupKFold(x)
folds2 <- lapply(folds, function(x) lapply(1:10, function(i) sample(x, size = length(x), replace = TRUE)))

but unfortunately I cannot figure out how exactly. Could you help me with that?

missuse
  • 19,056
  • 3
  • 25
  • 47
snoopy
  • 33
  • 1
  • 3
  • If I understand correctly you need stratified k- fold CV with blocking? Could you provide a toy example of the labels? – missuse Jan 12 '18 at 06:36
  • Yes, exactly. My labels are numeric (floats) - I'm using random forest in regression mode. – snoopy Jan 12 '18 at 08:17
  • stratified repeated k-fold CV with blocking (preferentially compatible with caret's `train` function) – snoopy Jan 12 '18 at 11:07
  • I am having troubles in providing a solution that includes stratification and blocking at the same time. Stratification of numeric variables can be done utilizing percentiles, however when there is blocking involved best I could think of was to use some summary statistic of the blocking groups and stratify according to that. Would that be ok? – missuse Jan 12 '18 at 12:30
  • Yes, that would be great! (So far, I've been using the built-in percentile functionality.) In my case the stratification doesn't need to be 100% accurate, a rough one would suit very well. – snoopy Jan 12 '18 at 14:50

1 Answers1

5

Here is an approach to perform stratified repeated K-fold CV with blocking.

library(caret)
library(tidyverse)

some fake data where id will be the blocking factor:

id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id, y, x)

summarise the observations by the blocking factor:

df %>%
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1 

create the stratified folds based on the grouped data:

folds <- createMultiFolds(groups1$mean, 10, 3)

back join the original df to the group data and take the df row id's

folds <- lapply(folds, function(i){
  data.frame(id = i) %>%
    left_join(df %>%
                rowid_to_column()) %>%
    pull(rowid) 
})

check if the data id's in the test are not in the train:

lapply(folds, function(i){
  sum(df[i,1] %in% df[-i,1])
})

output is a bunch of zeros, meaning no id's in the test folds are in the train folds.

If your group id's are not numeric there are two approaches to make this work:
1 convert them to numeric:

first some data

id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id = paste0("id_", id), y, x) #factor id's

df %>%
  mutate(id = as.numeric(id)) %>% #convert to numeric
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1 

folds <- createMultiFolds(groups1$mean, 10, 3)

folds <- lapply(folds, function(i){
  data.frame(id = i) %>%
    left_join(df %>%
                mutate(id = as.numeric(id)) %>% #also need to convert to numeric in the original data frame
                rowid_to_column()) %>%
    pull(rowid) 
})  

2 filter the id's in grouped data according to fold indexes and then join by id's

df %>%
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1 

folds <- createMultiFolds(groups1$mean, 10, 3)

folds <- lapply(folds, function(i){
  groups1 %>% #start from grouped data
    select(id) %>% #select id's
    slice(i) %>% #filter id's according to fold index
    left_join(df %>% #join by id 
               rowid_to_column()) %>%
    pull(rowid) 
})

Will it work for caret?

ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = folds)

rf.ctrl10 <- train(x = df[,-c(1:2)], y = df$y, data = df, method = "rf", tuneLength = 1,
                   ntree = 20, trControl = ctrl.10fold, importance = TRUE)

rf.ctrl10$results
#output
  mtry     RMSE    Rsquared       MAE     RMSESD  RsquaredSD      MAESD
1    3 1.041641 0.007534611 0.8246514 0.06953668 0.009488169 0.05934975

Also I suggest you check out library mlr, it has many nice features including blocking - here is one answer on SO. It has very nice tutorials on many things. For a long time I thought you either use caret or mlr but they complement each other very nicely.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thank you very much for all the help and recommendations! I'll have a look at the `mlr` package. In your provided solution I'm having a hard time with one code part: 'back join the original df to the group data and take the df row id's'. The actual ids (which are character strings in my case) are not contained in `folds`, only in `groups1`, so the back joining doesn't work for me. I get the error: Can't join on 'id' x 'id' because of incompatible types (character / integer). So, I think I have to join `folds` and `groups1` first, and subsequently with the df, or did I miss something? – snoopy Jan 15 '18 at 11:47
  • @snoopy I added two approaches that can be used with character grouping id's. – missuse Jan 15 '18 at 13:14
  • @missuse have you tried time series cross validation and blocking together? Like keeping all items of various months together (montht is the blocking) but keeping the time series nature? This would be useful ... I will try the blocking and mybe later I can ask here about the combination. – Richi W Aug 17 '18 at 14:11
  • @Richard if it is explainable it is doable. I am not quite sure I follow you exactly, you would like to predict the month ahead, for each month, based on previous data? Or you would like to predict each month based on data around it? Anyway this would require editing the answer beyond the scope of the current question by snoopy so I advise you to post another question. If you link it here in the comments I will be sure to check it. – missuse Aug 18 '18 at 11:44
  • @missuse thank you for your comment. Right, what I mean is rather using only data of the past (before that month) but always using whole months (blocking). Whole months of the past for training and then one month for assessment. My problem is that I am not satisfied with what I can do in caret. The functions for keras are quite limited (can we just use one layer?). mlr does not even support keras. So I think about trying the package rsample. First blocking only (which should be group_vfold_cv and maybe looking at time series cv w/o blocking using rolling_origin. – Richi W Aug 18 '18 at 18:38
  • @Richard I am not sure if keras supports cross validation out of the box (correct me if I am wrong), but you can construct a manual function where for each cv fold a train and validation set is specified. Perhaps in keras this can be performed using generator functions. I am not that confident in my generator function skills so I would probably opt for a wrapper function that will provide the train and test indexes. When anything a bit more complicated is needed in this regard I tend to construct my own indexes for training and testing utilizing dplyr and friends. – missuse Aug 18 '18 at 20:17