1

I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?

zero323
  • 322,348
  • 103
  • 959
  • 935
HHH
  • 6,085
  • 20
  • 92
  • 164

2 Answers2

1

Yes, see the Spark mllib naives bayes documentation.

model.save(sc, "myModelPath")
val sameModel = NaiveBayesModel.load(sc, "myModelPath")
dpeacock
  • 2,697
  • 13
  • 16
1

In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o parameter if training is via the CLI:

mahout trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${"path/to/model}/model 
  -li ${PATH_TO_MODEL}/labelindex 
  -ow 
  -c

See: http://mahout.apache.org/users/classification/bayesian.html

And retrieved via:

NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());

Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o parameter:

mahout spark-trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${/path/to/model}
  -ow 
  -c

or the a model can be trained from within an application via:

val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)

Output to (HD)FS by:

model.dfsWrite("/path/to/model")

and retrieved via:

val retrievedModel =  NBModel.dfsRead("/path/to/model")

See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html