5

i want to use naive Bayes classifier to make some predictions. So far i can make the prediction with the following (sample) code in R

library(klaR)
library(caret)


Faktor<-x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
alter<-abs(rnorm(10000,30,5))
HF<-abs(rnorm(10000,1000,200))
Diffalq<-rnorm(10000)
Geschlecht<-sample(c("Mann","Frau", "Firma"),10000,replace=TRUE)
data<-data.frame(Faktor,alter,HF,Diffalq,Geschlecht)

set.seed(5678)
flds<-createFolds(data$Faktor, 10)

train<-data[-flds$Fold01 ,]
test<-data[flds$Fold01 ,]

features <- c("HF","alter","Diffalq", "Geschlecht")

formel<-as.formula(paste("Faktor ~ ", paste(features, collapse= "+")))

nb<-NaiveBayes(formel, train, usekernel=TRUE)

pred<-predict(nb,test)

test$Prognose<-as.factor(pred$class)

Now i want to improve this model by feature selection. My real data is about 100 features big. So my question is , what woould be the best way to select the most important features for naive Bayes classification? Is there any paper dor reference?

I tried the following line of code, bit this did not work unfortunately

rfe(train[, 2:5],train[, 1], sizes=1:4,rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))

EDIT: It gives me the following error message

Fehler in { :   task 1 failed - "nicht-numerisches Argument für binären Operator"
Calls: rfe ... rfe.default -> nominalRfeWorkflow -> %op% -> <Anonymous>

Because this is in german you may please reproduce this on your machine

How can i adjust the rfe() call to get a recursive feature elimination?

  • 1
    This question appears to be off-topic because it is about variable selection for a specific statistical model; this is not a specific programming question. You might consider posting to [stats.se] instead. – MrFlick Jun 24 '14 at 18:02
  • i partly disagree MrFlick because it is a two way question. Because i do not want to violate the rules of this site i limit my question to the following extend: how do i have to adjust `ref()` to get my piece of code above work? –  Jun 24 '14 at 18:20
  • EDIT..i mean `rfe(..)` sorry! –  Jun 24 '14 at 18:27
  • @ewuenob Then please edit the original question to make your specific question very clear. Don't ask for things like paper references. Also, it's never enough to say something "does not work". If you get an error message, you should include that. If it does not work the way you expect, describe what you thought would happen and what actually did happen. – MrFlick Jun 24 '14 at 18:34
  • @MrFlick... DONE...because the error message is in german i think it is best practice to run my code and see what error occurs. –  Jun 24 '14 at 19:10

1 Answers1

2

This error appears to be due to the ldaFuncs. Apparently they do not like factors when using matrix input. The main problem can be re-created with your test data using

mm <- ldaFuncs$fit(train[2:5], train[,1])
ldaFuncs$pred(mm,train[2:5])
# Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) : 
#   non-numeric argument to binary operator

And this only seems to happens if you include the factor variable. For example

mm <- ldaFuncs$fit(train[2:4], train[,1])
ldaFuncs$pred(mm,train[2:4])

does not return the same error (and appears to work correctly). Again, this only appears to be a problem when you use the matrix syntax. If you use the formula/data syntax, you don't have the same problem. For example

mm <- ldaFuncs$fit(Faktor ~ alter + HF + Diffalq + Geschlecht, train)
ldaFuncs$pred(mm,train[2:5])

appears to work as expected. This means you have a few different options. Either you can use the rfe() formula syntax like

rfe(Faktor ~ alter + HF + Diffalq + Geschlecht, train, sizes=1:4,
    rfeControl =  rfeControl(functions = ldaFuncs, method = "cv"))

Or you could expand the dummy variables yourself with something like

train.ex <- cbind(train[,1], model.matrix(~.-Faktor, train)[,-1])
rfe(train.ex[, 2:6],train.ex[, 1], ...)

But this doesn't remember which variables are paired in the same factor so it's not ideal.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thank you very much for this detailed answer... this is absolutely sufficient for my purposes. Just because i am curious i want to ask if there is a way to perform some kind of exhaustive search over all possible combinations of features? I know this is quite a lot (2^n possible combinations if we got n features). But with small feature sizes this may be a way to go. –  Jun 25 '14 at 04:31
  • @ewuenhob I really never use these functions myself so I can't say. – MrFlick Jun 25 '14 at 04:38
  • that is no problem. Your answer did help me a lot! Maybe i find out an other way. Since this question about exhaustive search is not the key topic of this post i may open a new post about exhaustive search. Thank you very much! –  Jun 25 '14 at 04:52