Why does Spark ML NaiveBayes output labels that are different from the training data?

Question

I use the NaiveBayes classifier in Apache Spark ML (version 1.5.1) to predict some text categories. However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?

Here is a small example that can be pasted into e.g. Zeppelin notebook:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
  (0L, "X totally sucks :-(", 100.0),
  (1L, "Today was kind of meh", 200.0),
  (2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val nb = new NaiveBayes()

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, nb))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
  (4L, "roller coasters are fun :-)"),
  (5L, "i burned my bacon :-("),
  (6L, "the movie is kind of meh")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prediction: Double) =>
    println(s"($id, $text) --> prediction=$prediction")
  }

The output from the small program:

(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0

The set of predicted labels {0.0, 1.0, 2.0} are disjoint from my training set labels {100.0, 200.0, 300.0}.

Question: How can I map these predicted labels back to my original training set labels?

Bonus question: why do the training set labels have to be doubles, when any other type would work just as well as a label? Seems unnecessary.

score 4 · Accepted Answer · answered Nov 25 '15 at 08:46

4

However, the classifier outputs labels that are different from the labels in my training set. Am I doing it wrong?

Kind of. As far as I can tell you're hitting the issue described by SPARK-9137. Generally speaking all classifiers in ML expect 0 based labels (0.0, 1.0, 2.0, ...) but there is no validations step in ml.NaiveBayes. Under the hood data is passed to mllib.NaiveBayes which doesn't doesn't have this limitation so training process works smoothly.

When model is transformed back to ml, prediction function simply assumes that labels where correct, and returns predicted label using Vector.argmax, hence the results you get. You can fix the labels using for example StringIndexer.

why do the training set labels have to be doubles, when any other type would work just as well as a label?

I guess it is mostly a matter of keeping simple and reusable API. This way LabeledPoint can be used for both classification and regression problems. Moreover it is an efficient representation in terms of memory usage and computational cost.

answered Nov 25 '15 at 08:46

zero323

322,348
103
959
935

I understand your explanation, thank you. From a usability point of view, however, the behaviour seems like a bug since the user should not be expected to know the internals of the API. But, like you point out, this has already been described in issue 9137. Thanks. – Pimin Konstantin Kefaloukos Nov 26 '15 at 09:45
2

I would even argue that forcing the user to pick double-typed labels in the range 0-n is unintuitive in the first place. Often the labels of the data are strings, like names. This forces the user to map those labels to doubles as preprocessing, which is boring boiler-plate code. – Pimin Konstantin Kefaloukos Nov 26 '15 at 09:50
Yep, it is confusing. What is worse it can be a silent error if labels overlap with expected range. Regarding labels is a little bit more complicated and there different issues like static typing, labels hashability and a cost of the encoding. – zero323 Nov 26 '15 at 18:34

Why does Spark ML NaiveBayes output labels that are different from the training data?

1 Answers1