2

I intend to use care::sbf to do univariate feature selection, wheres my input is dataframe with mulitple variables (a.k.a, its columns), list of candidate features, and label (a.k.a, categorical variables). After I read caret package documentation, I tried of using sbf, sbfController to do feature selection, but I ran into an error down below:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

can anyone point me how to resolve this error? what's correct of using caret::sbf to do feature selection? any thought?

reproducible example:

here is the reproducible example on public gist where I used it as input.

my current attempt:

library(caret)
library(e1071)
library(randomForest)

df=read.csv("df.csv", header=True)

sbfCtrl <- sbfControl(method = 'cv', number = 10, returnResamp = 'final', functions = caretFuncs, saveDetails = TRUE)

model <- sbf(form= ventil_status~ .,
                 data= df,
                 methods='knn',
                 trControl=trainControl(method = 'cv', classProbs = TRUE),
                 tuneGrid=data.frame(k=1:10),
                 sbfControl=sbfControl(functions = sbfCtrl,
                                       methods='repeatedcv', number = 10, repeats = 10))

print(model)
print(model$fit$results)

> model <- sbf(ventil_status~ ., data=df, sizes=c(1,5,10,20),
+              method= 'knn', trControl=trainControl(method = 'cv', classProbs = TRUE),
+              tuneGrid = data.frame(k=1:10),
+              sbfControl=sbfCtrl)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

I googled this error but still couldn't get over it. Any idea to make the above code work? what's the correct way to do filter selection by using caret::sbf ?

what I want is output dataframe must have selected features with its p-value attached to it. So here is my attempt:

newdf <- df[ , -which(names(df) %in% c("subject"))]
p_value_vector <- sapply(names(newdf), function(i) 
    tryCatch(
        wilcox.test(newdf[newdf$ventil_status %in% "0", i], 
                        newdf[newdf$ventil_status %in% "1", i], 
                    na.action(na.omit))$p.value),
    warning = function(w) return(NA),
    error = function (e) return(NA)
)

expected output:

I am expecting output dataframe with selected features wheres its p-value returned by wilcox.test should be attached to corresponding features. any idea to make this happen in r? How can I operate feature selection using caret::sbf properly? any thought?

here is my R sessioninfo:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggpubr_0.2.5        magrittr_1.5        reshape2_1.4.3     
 [4] forcats_0.5.0       purrr_0.3.3         readr_1.3.1        
 [7] tibble_2.1.3        tidyverse_1.3.0     stringr_1.4.0      
[10] dplyr_0.8.5         scales_1.1.0        tidyr_1.0.2        
[13] aws.s3_0.3.20       randomForest_4.6-14 e1071_1.7-3        
[16] mlbench_2.1-1       caret_6.0-86        ggplot2_3.3.0      
[19] lattice_0.20-38  
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
beyond_inifinity
  • 443
  • 13
  • 29
  • @StupidWolf I am trying to understand your answer but one thing struck me, in this script: `knnSBF$filter <- function(score, x, y) { score <= 0.05 }`, where argument `x`, `y` are not used, is `model1$optVariables` still considered as filtered features? can you elaborate this point bit more? thanks – beyond_inifinity Apr 01 '20 at 14:28
  • 1
    It is the way sbf is written, if you look at the underlying code, https://github.com/topepo/caret/blob/master/pkg/caret/R/selectByFilter.R , there is a line there is this line retained <- sbfControl$functions$filter(scores, x, y) – StupidWolf Apr 01 '20 at 14:36
  • @StupidWolf instead of using `p_value_vector`, can't we use `knnSBF$score` for the filtered features? how to access `p-value` of `model1$optVariables` directly since `wilcox.text` returned `p-value`? – beyond_inifinity Apr 01 '20 at 14:38
  • how do you want to use it? – StupidWolf Apr 01 '20 at 14:40
  • @StupidWolf I am expecting filtered features with corresponding `p-value` in the dataframe, I think using `p_value_vector` not make much sense to me. why don't we kept filtered features with its p-value, instead getting `p_value_vector` that done outside filter extraction? – beyond_inifinity Apr 01 '20 at 14:43
  • caret doesn't return you the p-values, you have to calculate it yourself, the p_value_vector is a way to calculate it for all variables. Then you subset on the values that are used in the final model and place it in a data.frame like I have explained before – StupidWolf Apr 01 '20 at 14:46
  • 1
    data.frame(model1$optVariables,p_value_vector[model1$optVariables]) – StupidWolf Apr 01 '20 at 14:46

1 Answers1

4

For using sbf, you can use caretSBF and then add in the score and filter as you like them defined:

library(mlbench)
library(caret)

knnSBF = caretSBF
knnSBF$summary <- twoClassSummary
knnSBF$score <- function(x, y) {
    wilcox.test(x ~ y)$p.value
}
knnSBF$filter <- function(score, x, y) {
     score <= 0.05
}

Then you define the training parameters and sbf parameters:

sbfCtrl <- sbfControl(method = "cv",number = 3,
functions = knnSBF,saveDetails = TRUE)

trn_grid <- expand.grid(k=c(2,6,10))

trCtrl <-  trainControl(method = "cv",number = 3,
                        classProbs = TRUE,verboseIter = TRUE)

Then run the train:

data(Sonar)
y = Sonar$Class
x = Sonar[,-ncol(Sonar)]
set.seed(111)
model1 <- sbf(x,y,trControl = trCtrl,
                sbfControl = sbfCtrl,
                method = "knn",
                tuneGrid = trn_grid)

model1$variables
$selectedVars
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V8"  "V9"  "V10" "V11" "V12" "V13"
[13] "V14" "V20" "V21" "V22" "V36" "V37" "V42" "V43" "V44" "V45" "V46" "V47"
[25] "V48" "V49" "V50" "V51" "V52" "V54" "V58"

$selectedVars
 [1] "V4"  "V5"  "V6"  "V9"  "V10" "V11" "V12" "V13" "V14" "V20" "V21" "V22"
[13] "V28" "V31" "V34" "V35" "V36" "V37" "V43" "V44" "V45" "V46" "V47" "V48"
[25] "V49" "V51" "V52"

$selectedVars
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
[13] "V13" "V14" "V21" "V22" "V23" "V34" "V35" "V36" "V37" "V43" "V44" "V45"
[25] "V46" "V47" "V48" "V49" "V50" "V51" "V52" "V53" "V56" "V58"

I don't think they return you the p-values, although I might be wrong. For you function to calculate the p-values, using the above example

p_value_vector <- apply(x,2,function(i)wilcox.test(i~y)$p.value)
beyond_inifinity
  • 443
  • 13
  • 29
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • where can I see selected features by using p-value threshold? – beyond_inifinity Mar 31 '20 at 22:02
  • which one are you referring to? sbf or using the p_value_vector? – StupidWolf Mar 31 '20 at 22:05
  • I mean in`sbf`. where you used `knnSBF$filter`? I think `knnSBF$filter` will be used for feature filtering based on customized p-value threshold, am I right? why we use `twoClassSummary`? Thanks for your help! – beyond_inifinity Mar 31 '20 at 22:08
  • 1
    yes that's correct. So in the example above, I also trained to test for 3 values of k, and if you look at model$fit, you see that the best model is chosen for k = 2, and it has 35 variables, which can be found in model1$optVariables – StupidWolf Mar 31 '20 at 22:12
  • there is something i am trying to figure out, in the example above, model1$variables, this is "a list of variable names that survived the filter at each resampling iteration" according to the vignette – StupidWolf Mar 31 '20 at 22:17
  • 1
    Ok I see now, this is not used, but more like to reflect the uncertainty of the filtering, based on sampling your data – StupidWolf Mar 31 '20 at 22:17
  • I want to export selected features, `model1$optVariables` with its p-values score as a dataframe, how can I do that? thanks again for your effort – beyond_inifinity Mar 31 '20 at 22:38
  • 1
    something like data.frame(model1$optVariables,p_value_vector[model1$optVariables]), using the p_value_vector from above – StupidWolf Mar 31 '20 at 22:52