1

I am using the caret and SuperLearner packages in R for repeated k-fold crossvalidation on a survey dataset. To keep it simple, the dataset comprises an outcome variable, and two features/predictors called feature1 and feature2. Crucially, the data consists of different survey waves across which I want to predict. I then want to check how strong the effect of the composition of my training data is on my model's performance.

To this end, I want to create a range of folds that varies the proportion of observations taken from the different groups and ranges from extremely balanced (all groups evenly represented) to extremely unbalanced (training data consists of one group and test data of another). This is easy enough to do if there are only two groups / survey waves (see my code below).

# Packages
library(tidyverse)
library(caret)
library(SuperLearner)

# Data with only two groups
df <- tibble(id = 1:1000,
             outcome = rnorm(1000),
             feature1 = rnorm(1000),
             feature2 = rnorm(1000),
             group = rep(1:2, each = 500) %>% as.character)

# Generate groups
groups <- groupKFold(df$group, k = length(unique(df$group)))

# Generate folds
folds <- list()
for (i in seq(0.6, 1.0, by = 0.01) %>% rep(each=100)) {
  for (j in 1:100) {
    folds[[paste0("Fold", i, "_", j)]] <- c(sample(x = groups$Fold1, size = i*length(groups$Fold1),       replace = FALSE),
                                            sample(x = groups$Fold2, size = (1.6-i)*length(groups$Fold2), replace = FALSE)) 
  }
}

Yet, how would I go about this if there are multiple, namely six groups? See examle below:

df <- tibble(id = 1:3000,
             outcome = rnorm(3000),
             feature1 = rnorm(3000),
             feature2 = rnorm(3000),
             group = as.character(rep(1:6, each = 500)))

Also, while caret works just fine for my purpose, I have troubles using SuperLearner for the actual training and testing. This is because while caret yields detailed performance measures for each fold and this is easy enough to trace back to the proportion of observations from the two groups, SuperLearner just delivers overall model performance measures for the different learners.

# CROSSVALIDATION IN CARET
train.control <- trainControl(method = "repeatedcv", index = folds)

model <- train(outcome ~ ., 
               data = df %>% select(-c(group, id)), 
               method = "lm",
               trControl = train.control, 
               tuneLength = 10)

perf <- model$resample

u <- list()

for (i in 1:length(folds)) {
  u[i] <- filter(df, id %in% folds[[i]] & group=="1") %>% nrow() / length(folds[[i]])
  u <- unlist(u)
}

u

perf %>% mutate(prop = u, diff = sqrt((0.5 - prop)^2)) -> perf

perf

ggplot(perf, aes(diff, Rsquared)) + 
  labs(y = "R²", x = "% difference waves") +
  scale_x_continuous(labels = scales::percent) +
  geom_smooth(method = "lm") +
  geom_point(alpha = .75) + 
  ggpubr::stat_cor() +
  theme_minimal()
Dr. Fabian Habersack
  • 1,111
  • 12
  • 30

0 Answers0