3

I've been looking around, and can't seem to find an answer to this question:
If I train Naive-bayes to be a classifier on some data. Then I RE-USE this training data as the TEST DATA. Shouldn't I get 100% classification success? Thanks for reading!

Edit: It seems I've stimulated a discussion above my level of understanding. As such I don't feel it is up to me to take the role of 'accepting' an answer. However I am grateful for your input and will read all answer.

OctaveParango
  • 113
  • 1
  • 14

2 Answers2

2

Actually, despite being accepted answer, @flyingmeatball's answer is (at least partially) wrong in this particular case. It describes related phenomenon, but clearly not the crucial one for the situation given.

What you described is a case, where you expect your model to have 100% training accuracy, and it does not. This has nothing to do with "data not expressing phenomenon enough" - this would be true for high generalization error, not training one.

Less than 100% training error means, that maybe data itself is too noisy to model (which flyingmeatball suggests), but actually, for training set this would be the case if and only if there are two exactly the same points with different labels. If this is not the case (and probably is not) the actual "problem" is that model that you selected has some internal bias. In simple terms - think about it as assumptions about the data, or even constraints, which your model will not change even if data clearly does not follow it. In particular, Naive Bayes have two such assumptions:

  1. Features are independent, meaning that there is no correlation, no important link between label and more than a single feature at the time. If your features are wind and temperature, Naive Bayes will asume that it can make good decisions based on temperature itself, for example assuming that "good temperature is around 20 degress" and the same for wind, for example "at most 10km/h". It will fail to find relation based on both values, like "it is important for the temperature minus the wind to be at least 30", or something similar

  2. It assumes particular distribution of values over each feature - usually this is MultiNomial distribution or Gaussian. These are nice families of distributions, but lots of features do not follow them. For example if your feature is "time at which people buy at my grocery store" (say you treat it as a continuos variable, measures exactly in some microseconds etc.), you will notice, that you have two "peak hours" one in the morning and one in the evening, thus Naive Bayes will do a horrible job fitting a single gaussian, which will have a peak at noon instead! Again, wrong assumption leads to wrong decisions.

Why do we do such assumptions then? Well, for many reasons, but one of them is because we care about generalization and not training, consequently it is a way of preventing our model from overfitting to heavily, at the cost of "underfitting" training set. This also helps dealing with noise, simplifies optimization and makes lots of other wonderful things :-)

Hope this helps.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Just to follow up, I thought the only assumptions were that the features were independent? My feature vectors are a set of booleans and I am trying to perform binary classification. I am wondering because: there are probably several cases where two identical vectors/data-points map to a different class/label. – OctaveParango May 10 '16 at 17:52
1

No. Not all of the variability in the data may be explained in the features you have selected. Imagine you are classifying whether it is a good day to play tennis. Your features are temp, wind, precip. Those may be good descriptors, but you didn't train on whether there is a parade in town! On the parade days, the tennis court is blocked off, so even though your features should do a good job of explaining the known data, there are outliers that didn't fit the features.

Generally, there is randomness in data that you will not be able to capture 100%.

Updated per comments below

The question is whether training and testing on the same dataset will be 100% accurate, which I think we both agree will not work (they didn't ask what the assumptions of the NB were). Here is a sample dataset demonstrating the scenario above:

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

df = pd.DataFrame([[1,1,0],[1,0,0],[0,0,1],[1,0,1],[1,1,0]], columns = ['hot','windy','rainy'])
targets = [1,1,0,0,0]
preds = gnb.fit(df, targets).predict(df)

print preds
array([1, 1, 0, 0, 1])

Notice that the first case and the last case are identical, but the classifier missed the prediction for the last case. This is because the data at hand does not always perfectly describe accompanying classification. There are many other assumptions to NB,which could also describe cases where it fails, (which you excellently pointed out below) but my goal was just to come up with a quick demonstration that they would hopefully understand and would answer the question.

flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
  • 1
    Let's assume a hypothetical 'perfect' feature selection where we train on the space of 'all and only' features that matter. Would Naive Bayes have a 100% accuracy in that case? Just trying to get this topic a bit more interesting! ;) – Marsellus Wallace May 09 '16 at 19:30
  • For any classifier, if you train it with data that 100% describes it you will have 100% accuracy. – flyingmeatball May 09 '16 at 20:03
  • you are wrong. Naive Bayes has underlying assumption about features independence, and about particular density family. Even "perfect features" usually will not yield 100% with NB approach, as they have to be not only "perfect" as such, but they also have to be "perfectly aligned with NB assumptions", which is often impossible. Furthermore, this is true for many classifiers. Thus the sentence about getting 100% acc with perfect features is false not only for NB but for many others. In order to get such property you need **consistency** with distribution provided and loss – lejlot May 09 '16 at 20:15
  • While you are technically correct, I think this is now splitting hairs as to what is defined as "perfect" - some would say that the data for a NB model isn't "perfect" if it doesn't have a gaussian distribution. – flyingmeatball May 09 '16 at 20:49
  • perfect was nicely defined by Gevorg "we train on the space of 'all and only' features that matter", which has nothing to do with following a priori assumptions of the model, which is the crucial part of fitting a valid one to your data, thus even if this is obvious for you, I believe it is always worth specifing, when OP is clearly not an expert in the field. – lejlot May 09 '16 at 21:29