1

I'm trying to improve the accuracy of my WEKA model by applying an unsupervised discretize filter. I need to decided on the number of bins and whether equal frequency binning should be used. Normally, I would optimize this using a training set.

However, how do I determine the bin size and whether equal frequency binning should be used when using cross-validation? My initial idea was to use the accuracy result of the classifier in multiple cross-validation tests to find the optimal bin size. However, isn't it wrong, despite using cross-validation, to use this same set to also test the accuracy of the model, because I then have an overfitted model? What would then be a correct way of determining the bin sizes?

I also tried the supervized discretize filter to determine the bin sizes, however this results in only in single bins. Does this mean that my data is too random and therefore cannot be clustered into multiple bins?

user3197231
  • 123
  • 3
  • 8

1 Answers1

2

Yes, you are correct in both your idea and your concerns for the first issue.

What you are trying to do is Parameter Optimization. This term is usually used when you try to optimize the parameters of your classifier, e.g., the number of trees for the Random Forest or the C parameter for SVMs. But you can apply it as well to pre-processing steps and filters.

What you have to do in this case is a nested cross-validation. (You should check https://stats.stackexchange.com/ for more information, for example here or here). It is important that the final classifier, including all pre-processing steps like binning and such, has never seen the test set, only the training set. This is the outer cross-validation.

For each fold of the outer cross-validation, you need to do an inner cross-validation on the training set to determine the optimal parameters for your model.

I'll try to "visualize" it on a simple 2-fold cross-validation

Data set
########################################

Split for outer cross-validation (2-fold)
#################### ####################
training set                     test set

Split for inner cross-validation
########## ##########
training         test

Evaluate parameters
########## ##########
build with  evaluated

bin size  5   acc 70%
bin size 10   acc 80%
bin size 20   acc 75%
...
=> optimal bin size: 10

Outer cross-validation (2-fold)
#################### ####################
training set                     test set
apply bin size 10
train model                evaluate model

Parameter optimization can be very exhausting. If you have 3 parameters with 10 possible parameter values each, that makes 10x10x10=1000 parameter combinations you need to evaluate for each outer fold.

This is a topic of machine learning by itself, because you can do everything from the naive grid search to evolutionary search here. Sometimes you can use heuristics. But you need to do some kind of parameter optimization every time.

As for your second question: This is really hard to tell without seeing your data. But you should post that as a separate question anyway.

Community
  • 1
  • 1
Sentry
  • 4,102
  • 2
  • 30
  • 38