2
library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)


sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
                  SL.library = c("SL.mean", "SL.ranger"),
                  verbose = TRUE, cvControl = list(V = 5))

In the above code, I'm performing a 5-fold CV to train a SuperLearner. However, what if I want to create my own folds in the data manually? I'm interested in trying this because I know there are clusters in my data, and I would like to perform CV on the folds that I've created.

Take for example that below are the five folds for my toy data: split1, ..., split5. Is there a way to use these 5 folds to perform cross-validation on instead of letting SuperLearner split up the data by itself?

set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2)) 
split1 <- X[index == 1, ]
split2 <- X[index == 2, ]
split3 <- X[index == 3, ]
split4 <- X[index == 4, ]
split5 <- X[index == 5, ]
split1.y <- Y[index == 1]
split2.y <- Y[index == 2]
split3.y <- Y[index == 3]
split4.y <- Y[index == 4]
split5.y <- Y[index == 5]
Adrian
  • 9,229
  • 24
  • 74
  • 132
  • I've never used this library, but the behavior of `cvControl` looks hard-coded into the source: https://www.rdocumentation.org/packages/SuperLearner/versions/2.0-26/source. – shadowtalker Jun 25 '20 at 15:35
  • 1
    @shadowtalker https://www.rdocumentation.org/packages/SuperLearner/versions/2.0-26/topics/SuperLearner.CV.control would using `validRows` work? – Adrian Jun 25 '20 at 16:11
  • For splitting i'd suggest using ´split(X, index)` and `split(Y, index)`. This will return a list of `length(out) == length(unique(index))`, and be more more readable and easier to use. From the function you linked to it seems `n <- NROW(x); cvControl = list(validRows = split(seq_len(n), index), V = length(unique(index)))` might be the answer to your problem. – Oliver Jun 26 '20 at 21:54

2 Answers2

0

There are some control parameters for the cross-validation procedure. You could use the validRows parameter. You will need a list with 5 elements, each element having a vector of all rows that correspond to the clusters you have predefined. Assuming you added a column that shows which cluster an observation belongs to, you could write something like:

cluster1_ids = which(df$cluster==1) #similar for other cluster values
L = list(cluster1_ids, cluster2_ids, cluster3_ids, cluster4_ids, cluster5_ids)
X = df[-c("cluster")]
sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
              SL.library = c("SL.mean", "SL.ranger"),
              verbose = TRUE, cvControl = list(V = 5, validRows=L))

Hope this helps!

J Frans
  • 59
  • 5
0

Repeating the preparation of data, there is a full solution. Last lines verify that training data exclude validation data.

library(SuperLearner)
library(MASS)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

set.seed(1)
index <- sample(1:5, size = nrow(X), replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2)) 

validRows=list()
for (v in 1:5)
  validRows[[v]] <- which(index==v)

sl_cv = SuperLearner(Y = Y, X = X, family = gaussian(),
                     SL.library = c("SL.mean", "SL.ranger"),
                     verbose = TRUE,
                     control = SuperLearner.control(saveCVFitLibrary = TRUE),
                     cvControl = list(V = 5, shuffle = FALSE, validRows = validRows))

# sample size deducted from length of declared validRows
n - sapply(sl_cv$validRows, length)

# sample size deducted from resulting models
sapply(1:5, function(i) length(sl_cv$cvFitLibrary[[i]]$SL.ranger_All$object$predictions))
Antoni
  • 342
  • 2
  • 7