0

I'm looking to combine learners each developed using different subsets of features and algorithms into a SuperLearner. I realize this is not how SuperLearning generally works, but please trust that I have my reasons.

I've been creating custom SL.___ functions and treating the subsets of features like hyper-parameters, but as you'll see below, this creates confusion when I try to call them within CV.SuperLearner.

Any suggestions? Is there an easier way to do this in the sl3 package?

library(tidyverse)

library(SuperLearner)
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-26
#> Package created on 2019-10-27

set.seed(123)

Example data set

data<- data.frame(
  id = 1:600, 
  a = sample(1:1000, size = 600, replace = TRUE), 
  b = rbinom(600, 1, .8), 
  c = rbinom(600, 100, .3), 
  d = sample(c(1:5), 600, replace = TRUE), 
  e = rpois(600, 4), 
  y = rnorm(600, 70, sd=15)
)

Creating a data frame containing only features/variables by dropping the ID column and the outcome "y"

data_x<-data %>%
  select(-c("id", "y"))

Creating 3 learners that each use a different algorithm/approach and a different subset of features. The 1st learner uses glm and features a, b, and c. The second learner uses LASSO regression and features b, d, and e, and the third learner uses a random forest with all default hyperparameters and features a, c, d, and e.

L1 = function(...) {
  SL.glm(..., X=L1_data)
}

L1_data<-data %>%
  select("a", "b", "c")

L2 = function(...) {
  SL.glmnet(..., X=L2_data, alpha = 1)
}

L2_data<-data %>%
  select("b", "d", "e")


L3<-function(...) {
  SL.ranger(..., X=L3_data)  
}

L3_data<-data %>%
  select("a", "c", "d", "e")

Not surprisingly, failing to define the "X" argument in the CV.SuperLearner command generates an error.

cv.SL_1 = CV.SuperLearner(Y=data$y, family = gaussian(), 
                         V=10, 
                         SL.library = c("L1", "L2", "L3"))

#> Error in CV.SuperLearner(Y = data$y, family = gaussian(), V = 10, SL.library = c("L1", : argument "X" is missing, with no default

But defining X within the CV.SuperLearner command generates errors too since now X has been defined twice. (I deleted most of the repetitive warnings and errors for everyone's sanity.)

cv.SL_2 = CV.SuperLearner(Y=data$y, X=data_x, family = gaussian(), 
                        V=10, 
                        SL.library = c("L1", "L2", "L3"))

#> Error in SL.glm(..., X = L1_data) : 
#>   formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L1 
#>   The Algorithm will be removed from the Super Learner (i.e. given weight 0)
#> Error in SL.glmnet(..., X = L2_data, alpha = 1) : 
#>   formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L2 
#>   The Algorithm will be removed from the Super Learner (i.e. given weight 0)
#> Error in SL.ranger(..., X = L3_data) : 
#>   formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L3 

Created on 2020-11-09 by the reprex package (v0.3.0)

user2230555
  • 435
  • 1
  • 3
  • 9

1 Answers1

0

Nevermind, I figured out how to write a proper screening algorithm within the SuperLearner package. Short example below.

L1 <- function(X,...){
  returnCols <- rep(FALSE, ncol(X))
  returnCols[names(X) %in% c("a","b","c")] <- TRUE
  return(returnCols)
}

cv.SL_1 = CV.SuperLearner(Y=data$y, family = gaussian(), 
                         V=10, 
                         SL.library = list(c("SL.glm","L1"),
                                            ("SL.glm", "All"))
)
user2230555
  • 435
  • 1
  • 3
  • 9