8

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.

That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.

An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.

White papers, book pointers, and PDFs much appreciated!

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
Silver Dragon
  • 5,480
  • 6
  • 41
  • 73

2 Answers2

14

You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.

The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.

Rockcoder
  • 8,289
  • 3
  • 32
  • 41
  • 3
    http://en.wikipedia.org/wiki/Root-mean-square_error_of_cross-validation#K-fold_cross-validation (links directly to k-fold cross validation within the wiki article you linked) – JoeCool Jun 12 '09 at 13:33
  • This bucket split is over test data, train data or all data? – Egalicia Sep 15 '18 at 18:57
2

As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Mark Davidson
  • 5,503
  • 5
  • 35
  • 54