Spark ML Word2Vec Serialization Issues

Question

Spark Version: 1.6.1

I have recently refactored our Word2Vec code to move to DataFrame based ml models, but I am having problem in serializing and loading the model locally.

I am able to successfully:

Fit the dataframe and create the model.
Retrieve synonyms.

When I try to serialize the model locally, vectors are not serialized and hence the size of the file is too small approx 2K for 10GB of data.

        FileOutputStream fo = new FileOutputStream("/tmp/word2vec");
        ObjectOutputStream so = new ObjectOutputStream(fo);
        so.writeObject(word2VecModel);
        so.flush();
        so.close();
        logger.info("Word2Vec model saved");

On loading the model and calling the findSynonyms() function results in below exception:

java.lang.NullPointerException at org.apache.spark.ml.feature.Word2VecModel.transform(Word2Vec.scala:224)

Is there a way to save the model locally ?

yes the dir /tmp exists and word2vec is the name of the file for model. — skgemini, Jun 09 '16 at 12:16

score 0 · Answer 1 · answered Oct 11 '16 at 15:50

0

Have you tried to use Model Persistence functionality that is included now out-of-the-box? You can either save separate model, whole pipeline, etc. I had tried that and that worked.

answered Oct 11 '16 at 15:50

Taras Matyashovskyy

426
3
12

Spark ML Word2Vec Serialization Issues

1 Answers1