can a trained classification model be stored in Apache Spark?

Question

I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?

score 1 · Answer 1 · answered Jul 09 '15 at 23:23

1

Yes, see the Spark mllib naives bayes documentation.

model.save(sc, "myModelPath")
val sameModel = NaiveBayesModel.load(sc, "myModelPath")

answered Jul 09 '15 at 23:23

dpeacock

2,697
13
16

score 1 · Accepted Answer · answered Jul 10 '15 at 14:48

In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o parameter if training is via the CLI:

mahout trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${"path/to/model}/model 
  -li ${PATH_TO_MODEL}/labelindex 
  -ow 
  -c

See: http://mahout.apache.org/users/classification/bayesian.html

And retrieved via:

NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());

Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o parameter:

mahout spark-trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${/path/to/model}
  -ow 
  -c

or the a model can be trained from within an application via:

val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)

Output to (HD)FS by:

model.dfsWrite("/path/to/model")

and retrieved via:

val retrievedModel =  NBModel.dfsRead("/path/to/model")

See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

can a trained classification model be stored in Apache Spark?

2 Answers2