I'm looking to combine learners each developed using different subsets of features and algorithms into a SuperLearner. I realize this is not how SuperLearning generally works, but please trust that I have my reasons.
I've been creating custom SL.___
functions and treating the subsets of features like hyper-parameters, but as you'll see below, this creates confusion when I try to call them within CV.SuperLearner
.
Any suggestions? Is there an easier way to do this in the sl3
package?
library(tidyverse)
library(SuperLearner)
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-26
#> Package created on 2019-10-27
set.seed(123)
Example data set
data<- data.frame(
id = 1:600,
a = sample(1:1000, size = 600, replace = TRUE),
b = rbinom(600, 1, .8),
c = rbinom(600, 100, .3),
d = sample(c(1:5), 600, replace = TRUE),
e = rpois(600, 4),
y = rnorm(600, 70, sd=15)
)
Creating a data frame containing only features/variables by dropping the ID column and the outcome "y"
data_x<-data %>%
select(-c("id", "y"))
Creating 3 learners that each use a different algorithm/approach and a different subset of features. The 1st learner uses glm and features a, b, and c. The second learner uses LASSO regression and features b, d, and e, and the third learner uses a random forest with all default hyperparameters and features a, c, d, and e.
L1 = function(...) {
SL.glm(..., X=L1_data)
}
L1_data<-data %>%
select("a", "b", "c")
L2 = function(...) {
SL.glmnet(..., X=L2_data, alpha = 1)
}
L2_data<-data %>%
select("b", "d", "e")
L3<-function(...) {
SL.ranger(..., X=L3_data)
}
L3_data<-data %>%
select("a", "c", "d", "e")
Not surprisingly, failing to define the "X" argument in the CV.SuperLearner command generates an error.
cv.SL_1 = CV.SuperLearner(Y=data$y, family = gaussian(),
V=10,
SL.library = c("L1", "L2", "L3"))
#> Error in CV.SuperLearner(Y = data$y, family = gaussian(), V = 10, SL.library = c("L1", : argument "X" is missing, with no default
But defining X within the CV.SuperLearner
command generates errors too since now X has been defined twice. (I deleted most of the repetitive warnings and errors for everyone's sanity.)
cv.SL_2 = CV.SuperLearner(Y=data$y, X=data_x, family = gaussian(),
V=10,
SL.library = c("L1", "L2", "L3"))
#> Error in SL.glm(..., X = L1_data) :
#> formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L1
#> The Algorithm will be removed from the Super Learner (i.e. given weight 0)
#> Error in SL.glmnet(..., X = L2_data, alpha = 1) :
#> formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L2
#> The Algorithm will be removed from the Super Learner (i.e. given weight 0)
#> Error in SL.ranger(..., X = L3_data) :
#> formal argument "X" matched by multiple actual arguments
#> Warning in FUN(X[[i]], ...): Error in algorithm L3
Created on 2020-11-09 by the reprex package (v0.3.0)