Hmm, so my inkling here is that the issue is in the NBAlgorithm.predict
method, and it occurs for the following reason. If you look at PreparedData
the frequency vector is being created using Spark MLLib's HashingTF
class. The default size of this vector is 1,048,576, and each token gets mapped to the index corresponding to its hash value modulo the size of the feature vector. My best guess given the information provided is that some of the resulting vector indices are producing 0 probability estimates across all classes in the Naive Bayes training (which would explain an indeterminate value in taking logs).
In light of this, I just threw in a parameter numFeatures
to PreparatorParams
in a 2.3 release to control the size of your feature vectors (set the default to 15000, although you can modify this as you wish in your engine.json
file), and tested some queries out. Let me know if this fixes the problem for you, otherwise, please provide whatever extra information you can regarding your data, and the queries that are producing these values.
---- EDIT ----
Alright, so here is a little trick I'm proposing for avoiding these NaN
values.
In particular, you can see from the documentation that the vector of posterior class probabilities (given the observed words in the document) is represented by the vector:

What this means is that the posterior probability of an observation being in class k given the word counts obtained from the document can be written as:

Now, we have the obvious equality:

Which in particular means that we can write the latter probability as:

So why does this all matter? Well we are worried about the case when the c_k values in the probability computation are negative numbers with large absolute values. The latter will constraint the largest of these to be 1, and the rest some values less than 1. That is, if without loss of generality, we assume that class 1 is associated to the c_1 with the smallest absolute value, then we have the equality:

I think that the power of these equations is better illustrated a code example:
import scala.math._
val probs = Seq(-13452, -13255, -13345)
// Results in a sequence of 0.0 values.
probs
.map(k => exp(k))
// Same transformation using the latter equalities.
// Note that this yields non-zero values.
probs
.map(k => (abs(probs.max) - abs(probs(k)))*log10(E))
.map(x => pow(10, x))
Will implement this catch asap, thank you for the heads up.