Why is OpenCV3.1 NormalBayesClassifier's error rate so high in this example?

Question

I'm trying to use OpenCV3.1's NormalBayesClassifier on a simple problem that I can easily generate training data for. I settled on classifying input numbers as even or odd. Obviously this can be computed directly with 100% accuracy, but the point is to exercise the ML capabilities of OpenCV in order to get familiar with it.

So, my first question is - is there a theoretical reason why NormalBayesClassifier wouldn't be an appropriate model for this problem?

If not, the second question is, why is my error rate so high? cv::ml::StatModel::calcError() is giving me outputs of 30% - 70%.

Third, what's the best way to bring the error rate down?

Here's a minimum, self-contained snippet that demonstrates the issue:

(To be clear, the classification/output should be 0 for an even number and 1 for an odd number).

#include <ml.h>
#include <iomanip>

int main() {

   const int numSamples = 1000;
   cv::RNG rng = cv::RNG::RNG((uint64) time(NULL));

   // construct training sample data

   cv::Mat samples;
   samples.create(numSamples, 1, CV_32FC1);

   for (int i = 0; i < numSamples; i++) {
      samples.at<float>(i) = (int)rng(10000);
   }

   // construct training response data

   cv::Mat responses;
   responses.create(numSamples, 1, CV_32SC1);

   for (int i = 0; i < numSamples; i++) {
      int sample = (int) samples.at<float>(i);
      int response = (sample % 2);
      responses.at<int>(i) = response;
   }

   cv::Ptr<cv::ml::TrainData> data = cv::ml::TrainData::create(samples, cv::ml::ROW_SAMPLE, responses);

   data->setTrainTestSplitRatio(.9);

   cv::Ptr<cv::ml::NormalBayesClassifier> classifier = cv::ml::NormalBayesClassifier::create();

   classifier->train(data);

   float errorRate = classifier->calcError(data, true, cv::noArray());

   std::cout << "Bayes error rate: [" << errorRate << "]" << std::endl;

   // construct prediction inputs
   const int numPredictions = 10;

   cv::Mat predictInputs;
   predictInputs.create(numPredictions, 1, CV_32FC1);

   for (int i = 0; i < numPredictions; i++) {
      predictInputs.at<float>(i) = (int)rng(10000);
   }

   cv::Mat predictOutputs;
   predictOutputs.create(numPredictions, 1, CV_32SC1);

   // run prediction

   classifier->predict(predictInputs, predictOutputs);

   int numCorrect = 0;

   for (int i = 0; i < numPredictions; i++) {
      int input = (int)predictInputs.at<float>(i);
      int output = predictOutputs.at<int>(i);
      bool correct = (input % 2 == output);

      if (correct)
         numCorrect++;

      std::cout << "Input = [" << (int)predictInputs.at<float>(i) << "], " << "predicted output = [" << predictOutputs.at<int>(i) << "], " << "correct = [" << (correct ? "yes" : "no") << "]"  << std::endl;
   }

   float percentCorrect = (float)numCorrect / numPredictions * 100.0f;

   std::cout << "Percent correct = [" << std::fixed << std::setprecision(0) << percentCorrect << "]" << std::endl;
}

Sample run output:

Bayes error rate: [36]
Input = [9150], predicted output = [1], correct = [no]
Input = [3829], predicted output = [0], correct = [no]
Input = [4985], predicted output = [0], correct = [no]
Input = [8113], predicted output = [1], correct = [yes]
Input = [7175], predicted output = [0], correct = [no]
Input = [811], predicted output = [1], correct = [yes]
Input = [699], predicted output = [1], correct = [yes]
Input = [7955], predicted output = [1], correct = [yes]
Input = [8282], predicted output = [1], correct = [no]
Input = [1818], predicted output = [0], correct = [yes]
Percent correct = [50]

from doc: "Normal Bayes Classifier This simple classification model assumes that feature vectors from each class are normally distributed (though, not necessarily independently distributed). So, the whole data distribution function is assumed to be a Gaussian mixture, one component per class. Using the training data the algorithm estimates mean vectors and covariance matrices for every class, and then it uses them for prediction." Obviously that assumption doesnt hold for odd vs. even. I think most ml classifiers aren't able to classify odd vs. even. — Micka, Oct 16 '16 at 14:59
Hi @Micka - Thanks for the clarification, but I'm not sure I understand. Can you elaborate a little and describe how that applies to the odd/even problem? — Daniel A. Thompson, Oct 16 '16 at 15:04
your feature space only has 1 dimension, the value. NBC assumes that both classes are normally distributed, so ideally you have a fixed point in your feature space for each class and the samples of that class are distributed around that point. But in odd vs. even your points are regular distributed (and non-separable in that one dimension which will be a problem fir many other ml classifiers) — Micka, Oct 16 '16 at 15:10
OK, that makes sense. Is there a more appropriate classifier in OpenCV that would be able to approach this problem? Alternatively, or in addition to that, is there a better example problem that I could try for the Bayes classifier? — Daniel A. Thompson, Oct 16 '16 at 15:13
have a look at the first answer of http://stats.stackexchange.com/questions/161189/train-a-neural-network-to-distinguish-between-even-and-odd-numbers . to example the bayes classifier I would generate random values from a normal distribution and add some noise to them — Micka, Oct 16 '16 at 15:19
OK, thanks. Add your comment as an answer and I'll accept it. — Daniel A. Thompson, Oct 16 '16 at 15:26

Pascal Soucy · Accepted Answer · 2016-10-17T13:28:13.140

In your code you provide to the algorithm a single feature, which is the number to classify. That is not enough, unless you provide several examples of the same numbers, multiple times. If you want the learning algorithm to learn something about odd vs even, you need to think about what kind of features could be used by the classifier to learn that. Most machine learning techniques require careful feature engineering by you first.

Since you want to experiment with ML, I suggest the following:

For each number, create say 5 features, one to encode each digit. Thus, 5 would be 00005 or f1=0, f2=0, f3=0, f4=0, f5=0 and 11098 would be f1=1, f2=2, f3=0, f4=9, f5=8.
If you have numbers larger than that, you can keep only the last 5 digits.
Train your classifier
Test with the same encoding. What you'd like from your classifier is to learn that only the last digit is important to determine odd vs even

If you want to play more with it, you could encode numbers in binary format. Which would make it even easier for the classifier to learn what makes a number odd or even.

Why is OpenCV3.1 NormalBayesClassifier's error rate so high in this example?

1 Answers1