0

I have a data set with a target variable of which some classes have only a few instances. I know that cross-validation might not be the best way to go, but I wonder how Weka handles this when using stratified k-fold cross-validation. Tried to search for the actual code here: http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-dev/3.7.6/weka/filters/supervised/instance/StratifiedRemoveFolds.java/ but I could not find it.

Example: Target variable has 3 classes, of which 2 have 50 instances and 1 has only 1. Stratify sampling tries to keep the class distribution the same, which is in this case impossible if we try 10-folds.

This might be a statexchange question, however I am not insterested in a statistical answer, just how the code works. For example using R with Rweka

require(RWeka)
iris_input  <- iris[1:101,]
iris_fit  <- J48(Species ~ ., data = iris_input, na.action = NULL)
evaluate_Weka_classifier(iris_fit,numFolds=10)

Hope my question is clear.

Might be linked to R: Cross validation on a dataset with factors

Community
  • 1
  • 1
Freddy
  • 419
  • 8
  • 16
  • I don't have time to try this one out today, but my suspicion is that the class with only 1 instance will be in only one fold. Otherwise it wouldn't really be creating folds. This [link](http://weka.wikispaces.com/Generating+cross-validation+folds+(Java+approach)) shows you how you can create your own folds with Java, but not sure if you can transfer that over to RWeka. If you could, that would show you exactly what is in each fold. – Walter Nov 22 '13 at 16:26
  • @Walter yeah some times I wish I had just started building my model in java, that creates a lot of options. But working with R now, which is also nice to learn. And I always have trouble reading in a lot of data in java, which is no problem in R :) – Freddy Nov 22 '13 at 18:46
  • 1
    @Walter I think that you are correct, when looking at the confusion matrix, evaluate_Weka_classifier(iris_fit,numFolds=10)$confusionMatrix, it was only misclassified once, so in all folds, it was only present one time. – Freddy Nov 23 '13 at 09:53

1 Answers1

0

When looking at the confusion matrix, only one instance was miss-classified. This means that it also occured only 1 time in all the folds.

evaluate_Weka_classifier(iris_fit,numFolds=10)$confusionMatrix

Freddy
  • 419
  • 8
  • 16