Handling missing attributes in Naive Bayes classifier

Question

I am writing a Naive Bayes classifier for performing indoor room localization from WiFi signal strength. So far it is working well, but I have some questions about missing features. This occurs frequently because I use WiFi signals, and WiFi access points are simply not available everywhere.

Question 1: Suppose I have two classes, Apple and Banana, and I want to classify test instance T1 as below.

enter image description here

I fully understand how the Naive Bayes classifier works. Below is the formula I am using from Wikipedia's article on the classifier. I am using uniform prior probabilities P(C=c), so I am omitting it in my implementation.

enter image description here

Now, when I compute the right-hand side of the equation and loop over all the class-conditional feature probabilities, which set of features do I use? Test instance T1 uses features 1, 3, and 4, but the two classes do not have all these features. So when I perform my loop to compute the probability product, I see several choices on what I'm looping over:

Loop over the union of all features from training, namely features 1, 2, 3, 4. Since the test instance T1 does not have feature 2, then use an artificial tiny probability.
Loop over only features of the test instance, namely 1, 3, and 4.
Loop over the features available for each class. To compute class-conditional probability for 'Apple', I would use features 1, 2, and 3, and for 'Banana', I would use 2, 3, and 4.

Which of the above should I use?

Question 2: Let's say I want to classify test instance T2, where T2 has a feature not found in either class. I am using log probabilities to help eliminate underflow, but I am not sure of the details of the loop. I am doing something like this (in Java-like pseudocode):

Double bestLogProbability = -100000;
ClassLabel bestClassLabel = null;

for (ClassLabel classLabel : allClassLabels)
{
    Double logProbabilitySum = 0.0;

    for (Feature feature : allFeatures)
    {
        Double logProbability = getLogProbability(classLabel, feature);

        if (logProbability != null)
        {
            logProbabilitySum += logProbability;
        }
    }

    if (bestLogProbability < logProbability)
    {
        bestLogProbability = logProbabilitySum;
        bestClassLabel = classLabel;
    }
}

The problem is that if none of the classes have the test instance's features (feature 5 in the example), then logProbabilitySum will remain 0.0, resulting in a bestLogProbability of 0.0, or linear probability of 1.0, which is clearly wrong. What's a better way to handle this?

bogatron · Accepted Answer · 2012-11-19T21:14:46.960

6

For the Naive Bayes classifier, the right hand side of your equation should iterate over all attributes. If you have attributes that are sparsely populated, the usual way to handle that is by using an m-estimate of the probability which uses an equivalent sample size to calculate your probabilities. This will prevent the class-conditional probabilities from becoming zero when your training data have a missing attribute value. Do a web search for the two bold terms above and you will find numerous descriptions of the m-estimate formula. A good reference text that describes this is Machine Learning by Tom Mitchell. The basic formula is

P_i = (n_i + m*p_i) / (n + m)

n_i is the number of training instances where the attribute has value f_i, n is the number of training instances (with the current classification), m is the equivalent sample size, and p_i is the prior probability for f_i. If you set m=0, this just reverts to the standard probability values (which may be zero, for missing attribute values). As m becomes very large, P_i approaches p_i (i.e., the probability is dominated by the prior probability). If you don't have a prior probability to use, just make it 1/k, where k is the number of attribute values.

If you use this approach, then for your instance T2, which has no attributes present in the training data, the result will be whichever class occurs most often in the training data. This makes sense since there is no relevant information in the training data by which you could make a better decision.

edited Nov 19 '12 at 21:14

answered Nov 19 '12 at 20:37

bogatron

18,639
6
53
47

Thanks. The m-estimate approach is for discrete data, right? Similar to Laplace smoothing? My problem is that the features are all continuous, and I am using a gaussian PDF to compute the likelihood probability densities. Is there an equivalent of m-estimation for continuous features? – stackoverflowuser2010 Nov 20 '12 at 18:30
Ah, I didn't realize you were using pdf's vice computed probabilities. Yes, the m-estimate is used with discrete data. You could probably still use the m-estimate if you replace n_i in the formula with n_i*pdf_i, where pdf_i is the Gaussian pdf value that you compute for the given attribute value. Then, for non-zero m, it would prevent your posterior probability from becoming zero and you can still use the value of m to balance between your computed probability and an assumed prior. – bogatron Nov 20 '12 at 18:48
I am not familiar with m-estimates, but I am thinking of a different approach similar to Laplace add-1 smoothing: just give a small probability to missing features so that the product is non-zero. For test instance T1 in my example, feature 2 would be given a tiny probability, like 0.000001 or something. I've implemented it, and it seems to work well. But is it a sound approach? – stackoverflowuser2010 Nov 21 '12 at 20:57
That is basically what the m-estimate accomplishes but it also guarantees that the default minimum probability won't be greater than an actual probability for a non-zero count. If your default probability is less than any non-zero probability and doesn't cause rounding to zero of the class-conditional probability, then it is a reasonable approach. – bogatron Nov 25 '12 at 05:34
Please see my answer for more details, but the generalisation of the m-estimate to arbitrary valued variables is to understand it in terms of a prior (the m estimate is an estimator based on the posterior mean of the parameter). Just using a small number may work in this specific instance, but it's very shaky in theory (*how* small should depend on all manner of things, like sample sizes, prior beliefs, etc). – Ben Allison Nov 26 '12 at 13:26

score 1 · Answer 2 · answered Nov 20 '12 at 12:19

I would be tempted to simply ignore any features not found in all classes at training. If you choose to do otherwise, you're essentially hallucinating data and then treating it equally to data that really existed in the classification step. So my simple answer to question 1 would be to simply make the decision on the basis of feature 3 (you just don't have enough information to do anything else). This is part of what the m estimate mentioned by @bogatron is doing.

There's a more complicated answer to this for classes in training where certain features are missing, but it would take a good deal more work. The m-estimate is really a point estimate of the posterior distribution over p_i (which in your case is mu_i, sigma_i) given your training data, which is composed of the prior on p_i (the fraction n_i / n) and the likelihood function p(data | p_i). In the case where you observe no datapoints, you can essentially revert to the prior for the predictive distribution of that feature.

Now, how do you go about estimating that prior? Well, if the number of classes in the problem is small, relative to the number for which some feature value is missing, you can infer the parameters of the prior from the classes which do have data, and consider the predictive distribution for the classes missing data as simply being this prior (for the classes having data, your predictive distribution is the posterior). Useful pointers for you would be that since you seem to be assuming your data are normally distributed (or at least characterised by their mean and standard deviation), the prior on the mean should also be normal for the sake of conjugacy. I'd probably want to avoid doing inference about the prior distribution of your standard deviations, since this is a bit fiddly if you're new to it.

Note however that this only makes sense if you have enough classes with observations for that feature that the fraction missing values is small. In particular, in your example you only have a single class with observations, so the best you could possibly do for the Feature One in class "Banana" would be to assume uncertainty about mu_1 was represented by a distribution centered around "Apple"'s mu_1 with some arbitrary variance. Or you could assume their mus were equal, in which case it would have no effect on the decision and you might as well have ignored it!

Thus, unfortunately, the answer to your Question 2 is that your code is doing the correct thing. If your new test instance has only features that have never been observed in training, how could you hope to pick a class for it? You can do no better than choose according to the prior.

Thanks for the explanation. Regarding the priors, I am currently assuming equiprobable priors. In fact, the Wikipedia article on Naive Bayes classification says: "A class' prior may be calculated by assuming equiprobable class, or by calculating an estimate for the class probability from the training set". Would it be ok to make this equiprobable assumption? If not, then it seems fairly arbitrary to calculate the prior as #Apple/#allClasses. What if the training data happened to have 1000 apples and 10 bananas because the grocery store happened to have a sale on bananas and they were all gone? — stackoverflowuser2010, Nov 26 '12 at 18:33
This is a point of frequent confusion---I'm not talking about the class prior, rather the prior on the parameters for each feature. If your feature is normally distributed, it has parameters mu_i (the mean) and sigma_i (the standard dev). I suggested a prior on mu_i, which encodes the idea that you have expectations about what mu_i will look like *before* you look at examples. If you have no examples to look at, then fall back to the prior. A good textbook on Bayesian methods will clarify: David Mackay's book, available online: http://www.inference.phy.cam.ac.uk/mackay/itila/book.html — Ben Allison, Nov 27 '12 at 10:04

Handling missing attributes in Naive Bayes classifier

2 Answers2