I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?
Asked
Active
Viewed 311 times
2 Answers
1
Yes, see the Spark mllib naives bayes documentation.
model.save(sc, "myModelPath")
val sameModel = NaiveBayesModel.load(sc, "myModelPath")

dpeacock
- 2,697
- 13
- 16
1
In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o
parameter if training is via the CLI:
mahout trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${"path/to/model}/model
-li ${PATH_TO_MODEL}/labelindex
-ow
-c
See: http://mahout.apache.org/users/classification/bayesian.html
And retrieved via:
NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());
Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o
parameter:
mahout spark-trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${/path/to/model}
-ow
-c
or the a model can be trained from within an application via:
val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
Output to (HD)FS by:
model.dfsWrite("/path/to/model")
and retrieved via:
val retrievedModel = NBModel.dfsRead("/path/to/model")
See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

Andrew Palumbo
- 119
- 4