Is it possible to load word2vec pre-trained available vectors into spark?

Question

Is there a way to load Google's or Glove's pre-trained vectors (models) such as GoogleNews-vectors-negative300.bin.gz into spark and performing operations such as findSynonyms that are provided from spark? or do I need to do the loading and operations from scratch?

In this post Load Word2Vec model in Spark , Tom Lous suggests converting the bin file to txt and starting from there, I already did that .. but then what is next?

In a question I posted yesterday I got an answer that models in Parquet format can be loaded in spark, thus I'm posting this question to be sure that there is no other option.

score 2 · Answer 1 · answered Mar 05 '19 at 00:23

Disclaimer: I'm pretty new to spark, but the below at least works for me.

The trick is figuring out how to construct a Word2VecModel from a set of word vectors as well as handling some of the gotchas in trying to create the model this way.

First, load your word vectors into a Map. For example, I have saved my word vectors to a parquet format (in a folder called "wordvectors.parquet") where the "term" column holds the String word and the "vector" column holds the vector as an array[float], and I can load it like so in Java:

// Loads the dataset with the "term" column holding the word and the "vector" column 
// holding the vector as an array[float] 
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");

//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
            .collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));

//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
                .collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));

//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);

private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
        final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
                .map(e -> Tuple2.apply(e.getKey(), e.getValue()))
                .collect(Collectors.toList());

        Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();

        return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
    }

Now you can construct the model from scratch. Due to a quirk in how Word2VecModel works, you must set the vector size manually, and do so in a weird way. Otherwise it defaults to 100 and you get an error when trying to invoke .transform(). Here is a way I've found that works, not sure if everything is necessary:

 //not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);

Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
        result.set(result.vectorSize(), 300);

Now you should be able to use result.transform() like you would with a self-trained model.

I haven't tested other Word2VecModel functions to see if they work correctly, I only tested .transform().

Is it possible to load word2vec pre-trained available vectors into spark?

1 Answers1