1

When trying to classify some text contents, I often get results similar to this:

{"category":"SOME_CATEGORY","confidence":NaN}

Except the fact that it is not valid JSON (Nan is not authorized), I don't understand what is happening.

If necessary I can provide intermediate values by attaching a debugger during the computation.

clemp6r
  • 3,665
  • 2
  • 26
  • 31

1 Answers1

1

Hmm, so my inkling here is that the issue is in the NBAlgorithm.predict method, and it occurs for the following reason. If you look at PreparedData the frequency vector is being created using Spark MLLib's HashingTF class. The default size of this vector is 1,048,576, and each token gets mapped to the index corresponding to its hash value modulo the size of the feature vector. My best guess given the information provided is that some of the resulting vector indices are producing 0 probability estimates across all classes in the Naive Bayes training (which would explain an indeterminate value in taking logs).

In light of this, I just threw in a parameter numFeatures to PreparatorParams in a 2.3 release to control the size of your feature vectors (set the default to 15000, although you can modify this as you wish in your engine.json file), and tested some queries out. Let me know if this fixes the problem for you, otherwise, please provide whatever extra information you can regarding your data, and the queries that are producing these values.

---- EDIT ----

Alright, so here is a little trick I'm proposing for avoiding these NaN values.

In particular, you can see from the documentation that the vector of posterior class probabilities (given the observed words in the document) is represented by the vector:

Equation 1

What this means is that the posterior probability of an observation being in class k given the word counts obtained from the document can be written as:

Equation 2

Now, we have the obvious equality:

Equation 3

Which in particular means that we can write the latter probability as:

Equation 4

So why does this all matter? Well we are worried about the case when the c_k values in the probability computation are negative numbers with large absolute values. The latter will constraint the largest of these to be 1, and the rest some values less than 1. That is, if without loss of generality, we assume that class 1 is associated to the c_1 with the smallest absolute value, then we have the equality:

Equation

I think that the power of these equations is better illustrated a code example:

import scala.math._

val probs = Seq(-13452, -13255, -13345)
// Results in a sequence of 0.0 values.
probs
    .map(k => exp(k)) 
// Same transformation using the latter equalities.
// Note that this yields non-zero values.
probs
    .map(k => (abs(probs.max) - abs(probs(k)))*log10(E))
    .map(x => pow(10, x))

Will implement this catch asap, thank you for the heads up.

Marco Vivero
  • 301
  • 1
  • 2
  • I changed the code to use a vector of 15000, it didn't fix the problem. The getScores function returns a double[30] array filled with NaNs: innerProduct(x, y) returns very large negative numbers (like -12351), so their exponent is zero, and then the normalize function tries to divide all zeroes by zero. Is it the same you mean by saying "the resulting vector indices are producing 0 probability estimates across all classes"? How can I debug further? – clemp6r Sep 11 '15 at 08:17
  • BTW I got pretty good results with Logistic Regression, the problem appears only with Naive Bayes – clemp6r Sep 11 '15 at 16:14
  • 1
    Yes, that is essentially what I was getting at. Looks like some of the probability estimates for tokens conditioned on each class are close to zero, so when you go through and do the final multiplication (this translates to the inner product when you look at log probabilities) in the un-normalized class probability computation, these all also end up being close to 0 (which translates to negative numbers with large absolute values). Now the fact that the log probability estimates are bounded is great as this can be leveraged. Let me think about it a little more and I will get back to you! – Marco Vivero Sep 11 '15 at 19:29
  • FYI using a vector of 15000 instead of the default seems to dramatically decrease the logistic regression training time, it also divides by 10 the memory footprint of the engine – clemp6r Sep 15 '15 at 09:01